https://dvc.orgGatsbyJSFri, 18 Jul 2025 03:38:11 GMThttps://dvc.org/blog/dvc-ray-part-2https://dvc.org/blog/dvc-ray-part-2Wed, 13 Mar 2024 00:00:00 GMT<p>In <a href="https://dvc.ai/blog/dvc-ray" target="_blank" rel="nofollow noopener noreferrer">Part 1</a> of the tutorial, we explored the basics
of setting up and integrating DVC with Ray for distributed machine learning
workflows. By leveraging Ray's distributed computing capabilities and DVC's data
version control, we establish a robust framework for managing complex ML
experiments. This combination allows for enhanced scalability, reproducibility,
and collaboration in ML projects.</p>
<p>In Part 2, we extend the solution to a Ray Cluster on AWS, demonstrating how to
adapt the setup for cloud-based distributed computing. This part involves
configuring AWS resources, deploying Ray clusters in the cloud, and running
DVC-managed pipelines at scale.</p>
<blockquote>
<p>We would like to express our gratitude to
<a href="https://www.linkedin.com/in/schuh/" target="_blank" rel="nofollow noopener noreferrer">Andreas Schuh</a> from
<a href="https://www.heartflow.com/" target="_blank" rel="nofollow noopener noreferrer">HeartFlow</a> for his contribution to this solution
and for providing ideas and feedback for the blog posts. 🤝</p>
</blockquote>
<details>
<summary>Table of Contents</summary>
<ul>
<li><a href="#%EF%B8%8Fdesign-scalable-ml-experiments-with-dvc-and-ray">🛠️ Design Scalable ML Experiments with DVC and Ray</a>
<ul>
<li><a href="#1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster">1 - Technical challenges of running DVC in a distributed Ray Cluster</a></li>
<li><a href="#2---overview-of-the-solution-design">2 - Overview of the Solution Design</a></li>
<li><a href="#3---discuss-the-solution-design">3 - Discuss the solution design</a>
<ul>
<li><a href="#%EF%B8%8Fuse-a-modified-dvclive-logger-to-upload-metrics-to-the-s3">☝️ Use a modified DVCLive logger to upload metrics to the S3</a></li>
<li><a href="#%EF%B8%8Fdownload-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete">☝️ Download DVCLive metrics to the DVC repository after the training is complete</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#set-up-and-run-dvc-in-distributed-ray-cluster">🚀 Set Up and Run DVC in Distributed Ray Cluster</a>
<ul>
<li><a href="#1---prepare-aws-and-dvc-studio-credentials">1 - Prepare <strong>AWS and DVC Studio credentials</strong></a></li>
<li><a href="#2---configure-ray-cluster-in-clusteryaml">2 - Configure Ray Cluster in <code>cluster.yaml</code></a>
<ul>
<li><a href="#set-the-cluster-name-and-auto-scaling-config">Set the cluster name and auto-scaling config</a></li>
<li><a href="#set-up-the-docker-image-for-the-head-and-worker-nodes">Set up the Docker image for the head and worker nodes</a></li>
<li><a href="#cloud-provider-configuration">Cloud-provider configuration</a></li>
<li><a href="#files-or-directories-to-copy-to-the-head-and-worker-nodes">Files or directories to copy to the head and worker nodes</a></li>
<li><a href="#additional-commands-to-set-up-nodes">Additional commands to set up nodes</a></li>
</ul>
</li>
<li><a href="#3---start-a-ray-cluster-on-aws">3 - Start a Ray Cluster on AWS</a></li>
<li><a href="#4---connect-to-the-head-node-and-set-up-credentials">4 - Connect to the Head Node and Set Up Credentials</a>
<ul>
<li><a href="#connecting-to-the-cluster">Connecting to the Cluster</a></li>
<li><a href="#setting-up-git-credentials">Setting Up Git Credentials</a></li>
<li><a href="#run-tests-to-check-the-correct-setup">Run tests to check the correct setup</a></li>
</ul>
</li>
<li><a href="#5---run-dvc-pipelines-on-the-remote-ray-cluster">5 - Run DVC Pipelines on the remote Ray Cluster</a></li>
<li><a href="#6---commit--push-experiments">6 - Commit & push experiments</a></li>
<li><a href="#7---stop-cluster">7 - Stop Cluster</a></li>
</ul>
</li>
<li><a href="#-summing-up-dvc--ray-integration">🎨 Summing Up: DVC + Ray Integration</a></li>
<li><a href="#references">References</a></li>
</ul>
</details>
<h2 id="️design-scalable-ml-experiments-with-dvc-and-ray" style="position:relative;">🛠️ Design Scalable ML Experiments with DVC and Ray<a href="#%EF%B8%8Fdesign-scalable-ml-experiments-with-dvc-and-ray" aria-label="️design scalable ml experiments with dvc and ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Moving from a local setup to deploying a multi-node Ray Cluster on AWS marks a
significant shift, bringing forth a range of challenges that necessitate careful
consideration. This section dives deep into these intricacies, shedding light on
the hurdles encountered when scaling ML workflows to the cloud. We aim to
provide a comprehensive analysis of these challenges and introduce refined
solutions for a smooth integration of DVC and Ray in distributed environments.
Through this exploration, we lay the groundwork for enhancing scalability,
efficiency, and seamless operation of ML pipelines on a larger scale.</p>
<p><strong>Goals for this section:</strong></p>
<ul>
<li>Identify and address the technical challenges of running DVC in a distributed
Ray cluster.</li>
<li>Design an efficient and scalable integration of DVC and Ray in a distributed
environment.</li>
<li>Propose solutions and best practices for overcoming these challenges.</li>
</ul>
<h3 id="1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster" style="position:relative;">1 - Technical challenges of running DVC in a distributed Ray Cluster<a href="#1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster" aria-label="1 technical challenges of running dvc in a distributed ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s outline the scope of the target solution for the following discussion:</p>
<ul>
<li>A Ray Cluster can add more worker nodes (auto-scaling) on AWS EC2.</li>
<li>All jobs are executed only on worker nodes (not on the head node) in Docker
containers.</li>
<li>The user runs DVC pipelines and commits results on the head node (connected by
SSH).</li>
<li>During the training, the user should be able to track metrics updated in live
mode.</li>
<li>Data and models are stored in AWS S3.</li>
<li>Code and metadata are versioned with Git.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/80aeeefacbcd205d9aa91fe53aa52a15/39600/2-challenges.png" alt="Challenges" title="Challenges" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Challenges
of running DVC in a distributed Ray Cluster</em></p>
<p>Let's review each challenge and its proposed solution:</p>
<ol>
<li><strong>Auto-Scaling Worker Nodes</strong>:
<ul>
<li>Challenge: Ensuring seamless integration with Ray's auto-scaling feature to
add or remove worker nodes based on workload demand dynamically.</li>
<li>Solution: Utilize Ray's built-in auto-scaling functionality, which allows
for the dynamic addition and removal of worker nodes as needed.</li>
</ul>
</li>
<li><strong>Execution on Worker Nodes Only</strong>:
<ul>
<li>Challenge: Ensuring that all jobs, including DVC pipelines and Ray tasks,
are executed exclusively on worker nodes to optimize resource utilization.
A specific part is a requirement to propagate DVC environment variables to
all worker nodes.</li>
<li>Solution: Configure the Ray cluster to execute all tasks and jobs
exclusively on worker nodes. Monitor the head node's load and use Ray's
capabilities to distribute tasks evenly across the worker nodes.</li>
</ul>
</li>
<li><strong>Live Metrics Tracking During Training</strong>
<ul>
<li>Challenge: Tracking real-time metrics during model training on distributed
worker nodes with <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>.</li>
<li>Solution: Use DVCLive, a lightweight library compatible with DVC, to track
real-time metrics during training sessions. Set up the pipeline to use
DVCLive on the rank 0 worker only (as discussed above). Ensure that
DVCLive, running on the rank 0 worker, has access to the
<a href="https://dvc.org/doc/user-guide/env" target="_blank" rel="nofollow noopener noreferrer">DVC environment variables</a>, including
<code>DVC_STUDIO_TOKEN</code>, to log metrics to DVC Studio.</li>
</ul>
</li>
<li><strong>Synchronize DVC pipeline artifacts with the head node.</strong>
<ul>
<li>Challenge: Ensuring that artifacts generated by DVC pipelines on worker
nodes are consistently and efficiently synchronized back to the head node,
where they can be versioned and committed to Git and DVC remote storage.</li>
<li>Solution: Setup
<ul>
<li><strong>From Worker to S3</strong>: Set up Ray to use an AWS S3 bucket as a persistent
storage to sync artifacts and checkpoints.</li>
<li><strong>From S3 to Head Node</strong>: After the distributed pipeline is complete,
pull the required artifacts and a model from the persistent storage on S3
to the project repository on the head node.</li>
</ul>
</li>
</ul>
</li>
</ol>
<h3 id="2---overview-of-the-solution-design" style="position:relative;">2 - Overview of the Solution Design<a href="#2---overview-of-the-solution-design" aria-label="2 overview of the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Here is a diagram that depicts the proposed solution:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab0b797760f194e14bb4046eb98dff40/39600/3-solution-design-2.png" alt="Solution Design for DVC with Ray in Clouds" title="Solution Design for DVC with Ray in Clouds" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Solution Design for DVC with Ray in Clouds</em></p>
<p>The diagram on the slide illustrates the integration of DVC (Data Version
Control) and Ray in a cloud-based environment, specifically using AWS EC2
instances. Let's break down the key components and steps outlined in the
diagram.</p>
<ol>
<li>Package project & Provision Ray Cluster: Provision of the Ray cluster on AWS
EC2 instances before running experiments. There are a few ways to do this:
<ul>
<li>Set up <code>cluster.yaml</code> to copy files and directories from the local machine
to the head and worker nodes.</li>
<li>Pull the code and dependencies from the Git repository or S3 bucket.</li>
</ul>
</li>
<li>Run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>: In a Ray cluster, the head node coordinates tasks and
manages resources. It initiates the execution of parallel tasks on worker
nodes. Connect to Ray cluster (head node), navigate to the project directory,
and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>.</li>
<li>Publish Live Metrics to Studio:
<ul>
<li>During the execution of <code>train.py</code>, DVCLive handles logging metrics and
parameters at Worker(rank=0) to avoid duplication.</li>
<li>DataChain Studio visualizes metrics updates in live mode.</li>
</ul>
</li>
<li>Push DVCLive logs from a Worker Node to S3: The current version of the
DVCLive logs metrics and artifacts to the filesystem on the rank 0 worker. To
make them available in the project repository on the head node after the
experiment is complete, a few modifications were made:
<ul>
<li>Use <code>DVCLiveRayLogger</code> as <a href="https://dvc.org/doc/dvclive/live" target="_blank" rel="nofollow noopener noreferrer">Live</a> -
extended with functionality to store metrics in s3</li>
<li>Modified Live.next_step() is responsible for uploading <code>/results/dvclive</code>
dir to s3 bucket: <code>s3://cse-cloud-version/tutorial-mnist-dvc-ray/</code> every
epoch.</li>
</ul>
</li>
<li>Pull DVCLive logs from S3 to the Head Node after completing the experiment.</li>
<li>Commit & Push the DVC experiment artifacts and metadata updates.</li>
</ol>
<h3 id="3---discuss-the-solution-design" style="position:relative;">3 - Discuss the solution design<a href="#3---discuss-the-solution-design" aria-label="3 discuss the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s summarise changes made in scripts to run in a distributed Ray cluster in
the cloud:</p>
<ul>
<li>Use a modified DVCLive logger to upload metrics to the S3 bucket every epoch.</li>
<li>Download DVCLive metrics to the DVC repository after the training is complete.</li>
</ul>
<h4 id="️use-a-modified-dvclive-logger-to-upload-metrics-to-the-s3" style="position:relative;">☝️ Use a modified DVCLive logger to upload metrics to the S3<a href="#%EF%B8%8Fuse-a-modified-dvclive-logger-to-upload-metrics-to-the-s3" aria-label="️use a modified dvclive logger to upload metrics to the s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>A modified <code>DVCLiveRayLogger</code> inherits from <code>Live</code> and introduces the ability to
push DVCLive metrics directly to an S3 bucket. This is necessary because the
code is executed on remote workers, and DVCLive can’t log metrics and artifacts
directly to the DVC repository.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">DVCLiveRayLogger</span><span class="token punctuation">(</span>Live<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> bucket_name<span class="token punctuation">,</span> s3_directory<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token builtin">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span>
self<span class="token punctuation">.</span>bucket_name <span class="token operator">=</span> bucket_name
self<span class="token punctuation">.</span>s3_directory <span class="token operator">=</span> s3_directory
<span class="token keyword">def</span> <span class="token function">next_step</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token builtin">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"\nDVCLiveLogger: Push DVCLive metrics to S3"</span><span class="token punctuation">)</span>
upload_to_s3<span class="token punctuation">(</span>self<span class="token punctuation">.</span><span class="token builtin">dir</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>bucket_name<span class="token punctuation">,</span> self<span class="token punctuation">.</span>s3_directory<span class="token punctuation">,</span><span class="token punctuation">)</span></code></pre></div>
<ul>
<li>By pushing DVCLive directory to S3, teams can easily share, access, and
analyze training progress from anywhere without relying on local file systems.</li>
</ul>
<h4 id="️download-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete" style="position:relative;">☝️ Download DVCLive metrics to the DVC repository after the training is complete<a href="#%EF%B8%8Fdownload-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete" aria-label="️download dvclive metrics to the dvc repository after the training is complete permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>Live object instance created from <code>DVCLiveRayLogger</code> behaves the same way as the
original DVCLive. There are a few changes in the configuration:</p>
<ul>
<li>Set <code>dir="results/dvclive"</code> to ensure that after the training DVC will
correctly resolve paths of logged metrics and artifacts.</li>
<li>Set <code>bucket_name</code> and <code>s3_directory</code> to save live metrics and artifacts in S3.</li>
</ul>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token comment"># [3] Set up Live object for DVCLive</span>
live <span class="token operator">=</span> <span class="token boolean">None</span>
<span class="token keyword">if</span> worker_rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span>
<span class="token comment"># Initialize DVC Live</span>
<span class="token keyword">from</span> src<span class="token punctuation">.</span>live <span class="token keyword">import</span> DVCLiveRayLogger <span class="token keyword">as</span> Live
live <span class="token operator">=</span> Live<span class="token punctuation">(</span>
<span class="token builtin">dir</span><span class="token operator">=</span><span class="token string">"results/dvclive"</span><span class="token punctuation">,</span>
save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span>
bucket_name <span class="token operator">=</span> <span class="token string">"cse-cloud-version"</span><span class="token punctuation">,</span>
s3_directory <span class="token operator">=</span> <span class="token string">"tutorial-mnist-dvc-ray/dvclive"</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>params<span class="token punctuation">:</span> <span class="token builtin">dict</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token comment"># Pull DVCLive logs from S3</span>
s3_directory <span class="token operator">=</span> <span class="token string">"tutorial-mnist-dvc-ray/dvclive"</span>
download_from_s3<span class="token punctuation">(</span>bucket_name<span class="token punctuation">,</span> s3_directory<span class="token punctuation">,</span> <span class="token string">'results/dvclive/'</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
train<span class="token punctuation">(</span>params<span class="token punctuation">)</span></code></pre></div>
<ul>
<li>At every training epoch, <code>live.next_step()</code> pushes the <code>results/dvclive</code>
directory to the S3 bucket.</li>
<li>After the training, use <code>download_from_s3()</code> to download DVCLive metrics to
the <code>results/dvclive/</code> in the DVC repository.</li>
</ul>
<h2 id="set-up-and-run-dvc-in-distributed-ray-cluster" style="position:relative;">🚀 Set Up and Run DVC in Distributed Ray Cluster<a href="#set-up-and-run-dvc-in-distributed-ray-cluster" aria-label="set up and run dvc in distributed ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote>
<p>💡 Note: Navigate to the <code>cloud</code> branch in the repository</p>
</blockquote>
<p>This section of the tutorial provides a step-by-step guide on how to set up and
run a DVC pipeline on a Ray cluster hosted on AWS. The integration of DVC with
Ray on AWS allows for scaling machine learning workflows, leveraging cloud
resources for distributed processing.</p>
<p><strong>Goals for this section:</strong></p>
<ul>
<li>Guide you through the steps to set up and run the example on a Ray cluster
hosted on AWS.</li>
<li>Explain specific solutions and best practices.</li>
</ul>
<h3 id="1---prepare-aws-and-dvc-studio-credentials" style="position:relative;">1 - Prepare <strong>AWS and DVC Studio credentials</strong><a href="#1---prepare-aws-and-dvc-studio-credentials" aria-label="1 prepare aws and dvc studio credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This example uses a simple AWS access configuration. Prepare AWS credentials for
use with Ray (or any other application that requires AWS access) and store them
in a specific file (<code>~/.aws/ray-credentials</code>) on a local machine. In the next
step, you’ll configure Ray to use this file.</p>
<p>For example, use the following CLI script to store AWS secrets to
<code>~/.aws/ray-credentials</code>:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">echo</span> <span class="token string">"[default]
aws_access_key_id = ASIAU7...
aws_secret_access_key = Fdpgl...
aws_session_token = IQoJb3JpZ...
"</span> <span class="token operator">></span> ~/.aws/ray-credentials</code></pre></div>
<p>To track metrics with DVC Studio, Save
your <a href="https://dvc.org/doc/studio/user-guide/account-and-billing#client-access-tokens" target="_blank" rel="nofollow noopener noreferrer">DVC Studio client access token</a> to
a <code>.dvc/config.local</code> file. Git or DVC does not track this file. In the next
step, you’ll configure Ray to use this file to provision the head and worker
nodes.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc config <span class="token parameter variable">--local</span> studio.token isat_2BlrAu0aileSH<span class="token punctuation">..</span>.</code></pre></div>
<h3 id="2---configure-ray-cluster-in-clusteryaml" style="position:relative;">2 - Configure Ray Cluster in <code>cluster.yaml</code><a href="#2---configure-ray-cluster-in-clusteryaml" aria-label="2 configure ray cluster in clusteryaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To initiate a Ray cluster on AWS, you will use a configuration file named
<code>cluster.yaml</code>, which outlines the specifications of your AWS setup, including
instance types, the number of nodes, and other settings. The <code>cluster.yaml</code> is
big and has a lot of comments. Let’s highlight only parts specific to the
current solution design.</p>
<h4 id="set-the-cluster-name-and-auto-scaling-config" style="position:relative;">Set the cluster name and auto-scaling config<a href="#set-the-cluster-name-and-auto-scaling-config" aria-label="set the cluster name and auto scaling config permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">cluster_name</span><span class="token punctuation">:</span> tutorial<span class="token punctuation">-</span>mnist<span class="token punctuation">-</span>dvc<span class="token punctuation">-</span>ray
<span class="token key atrule">max_workers</span><span class="token punctuation">:</span> <span class="token number">2</span>
<span class="token key atrule">upscaling_speed</span><span class="token punctuation">:</span> <span class="token number">1.0</span></code></pre></div>
<ul>
<li>
<p>In the Ray cluster configuration for the <code>tutorial-mnist-dvc-ray</code> cluster, the
<code>cluster_name</code> specifies a unique identifier for the cluster, distinguishing
it from other clusters you might be running. This name is used in managing and
tracking the cluster's resources.</p>
</li>
<li>
<p>The <code>max_workers</code> setting defines the maximum number of worker nodes the
cluster can scale up to in addition to the head node. It's set to <code>2</code> here,
meaning the cluster can run up to two worker nodes concurrently to process
tasks.</p>
</li>
<li>
<p>The <code>upscaling_speed</code> parameter controls how quickly the cluster can scale up
by adding more worker nodes when there's an increase in load or tasks. Set at
<code>1.0</code>, the autoscaler can increase the cluster size by up to 100% of the
currently running nodes at each scaling operation.</p>
</li>
</ul>
<h4 id="set-up-the-docker-image-for-the-head-and-worker-nodes" style="position:relative;">Set up the Docker image for the head and worker nodes<a href="#set-up-the-docker-image-for-the-head-and-worker-nodes" aria-label="set up the docker image for the head and worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>Using Docker enables you to run your distributed applications in a consistent
and controlled environment, leveraging Docker's containerization to manage
dependencies and system settings across all nodes seamlessly.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">docker</span><span class="token punctuation">:</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> <span class="token string">'rayproject/ray-ml@sha256:fa8c69ae055b92bf2f97e22c6a96ea835be60afa69c224d6e1275c3040833d0a'</span>
<span class="token key atrule">container_name</span><span class="token punctuation">:</span> <span class="token string">'ray_container'</span>
<span class="token key atrule">pull_before_run</span><span class="token punctuation">:</span> <span class="token boolean important">True</span>
<span class="token key atrule">run_options</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>ulimit nofile=65536<span class="token punctuation">:</span><span class="token number">65536</span></code></pre></div>
<p>This Ray cluster configuration segment specifies Docker settings for running
tasks across all nodes:</p>
<ul>
<li><code>image</code> The Docker image used for containers on all nodes, identified by its
SHA256 digest for consistency.</li>
<li><code>container_name</code> The name for Docker containers, set as <code>ray_container</code>.</li>
</ul>
<h4 id="cloud-provider-configuration" style="position:relative;">Cloud-provider configuration<a href="#cloud-provider-configuration" aria-label="cloud provider configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>This Ray cluster configuration outlines the setup for running distributed
applications on AWS, specifying both cloud provider settings and instance
configurations, including a unique approach for the head node.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">provider</span><span class="token punctuation">:</span>
<span class="token key atrule">type</span><span class="token punctuation">:</span> aws
<span class="token key atrule">region</span><span class="token punctuation">:</span> us<span class="token punctuation">-</span>west<span class="token punctuation">-</span><span class="token number">2</span>
<span class="token key atrule">availability_zone</span><span class="token punctuation">:</span> us<span class="token punctuation">-</span>west<span class="token punctuation">-</span>2a<span class="token punctuation">,</span>us<span class="token punctuation">-</span>west<span class="token punctuation">-</span>2b
<span class="token key atrule">cache_stopped_nodes</span><span class="token punctuation">:</span> <span class="token boolean important">True</span>
<span class="token key atrule">available_node_types</span><span class="token punctuation">:</span>
<span class="token key atrule">ray.head.default</span><span class="token punctuation">:</span>
<span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token punctuation">{</span> <span class="token key atrule">'CPU'</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token punctuation">}</span>
<span class="token key atrule">node_config</span><span class="token punctuation">:</span>
<span class="token key atrule">InstanceType</span><span class="token punctuation">:</span> m5.2xlarge
<span class="token key atrule">BlockDeviceMappings</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">DeviceName</span><span class="token punctuation">:</span> /dev/sda1
<span class="token key atrule">Ebs</span><span class="token punctuation">:</span>
<span class="token key atrule">VolumeSize</span><span class="token punctuation">:</span> <span class="token number">160</span>
<span class="token key atrule">VolumeType</span><span class="token punctuation">:</span> gp3
<span class="token key atrule">ray.worker.default</span><span class="token punctuation">:</span>
<span class="token key atrule">min_workers</span><span class="token punctuation">:</span> <span class="token number">1</span>
<span class="token key atrule">max_workers</span><span class="token punctuation">:</span> <span class="token number">2</span>
<span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token key atrule">node_config</span><span class="token punctuation">:</span>
<span class="token key atrule">InstanceType</span><span class="token punctuation">:</span> m5.2xlarge
<span class="token key atrule">InstanceMarketOptions</span><span class="token punctuation">:</span>
<span class="token key atrule">MarketType</span><span class="token punctuation">:</span> spot
<span class="token key atrule">BlockDeviceMappings</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">DeviceName</span><span class="token punctuation">:</span> /dev/sda1
<span class="token key atrule">Ebs</span><span class="token punctuation">:</span>
<span class="token key atrule">VolumeSize</span><span class="token punctuation">:</span> <span class="token number">160</span>
<span class="token key atrule">VolumeType</span><span class="token punctuation">:</span> gp3</code></pre></div>
<p>This configuration establishes a robust and cost-efficient Ray cluster on AWS,
leveraging both on-demand and spot instances for worker nodes to optimize costs
and performance:</p>
<ul>
<li><strong>Head Node</strong> (<code>ray.head.default</code>): Configured to use <code>m5.2xlar</code> instances,
with a custom block device mapping for increased EBS volume size (160 GB, gp3
type). Interestingly, the <code>resources</code> for the head node are set to <code>{"C": 0}</code>,
indicating it should not be used for computation-intensive tasks, focusing
instead on cluster management and coordination.</li>
<li><strong>Worker Nodes</strong> (<code>ray.worker.default</code>): Also set to use <code>m5.2xlar</code> instances
with similar storage configurations as a default. Worker nodes can run on spot
instances to reduce costs, and their CPU and GPU resources are auto-detected,
allowing them to be allocated for computational tasks. The configuration
supports scaling between 1 and 2 worker nodes dynamically.</li>
<li>Setting <code>{CPU: 0}</code> for the head node is a strategic choice to ensure it does
not run compute-intensive tasks. The head node manages the cluster's
operations, including task scheduling and resource allocation.</li>
</ul>
<h4 id="files-or-directories-to-copy-to-the-head-and-worker-nodes" style="position:relative;">Files or directories to copy to the head and worker nodes<a href="#files-or-directories-to-copy-to-the-head-and-worker-nodes" aria-label="files or directories to copy to the head and worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>The <code>file_mounts</code> configuration facilitates the replication of a consistent
working environment across the cluster by ensuring all nodes have the necessary
code, configurations, and credentials. This setup supports seamless distributed
execution of tasks, including data processing, training machine learning models,
and interacting with cloud services.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">file_mounts</span><span class="token punctuation">:</span>
<span class="token punctuation">{</span>
<span class="token key atrule">'/home/ray/tutorial-mnist-dvc-ray'</span><span class="token punctuation">:</span> <span class="token string">'.'</span><span class="token punctuation">,</span>
<span class="token key atrule">'/home/ray/tutorial-mnist-dvc-ray/.dvc/config.local'</span><span class="token punctuation">:</span> <span class="token string">'./.dvc/config.local'</span><span class="token punctuation">,</span>
<span class="token key atrule">'/home/ray/.aws/credentials'</span><span class="token punctuation">:</span> <span class="token string">'~/.aws/ray-credentials'</span>
<span class="token punctuation">}</span>
<span class="token key atrule">rsync_filter</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token string">'.gitignore'</span></code></pre></div>
<ul>
<li><code>/home/ray/tutorial-mnist-dvc-ray</code>: This entry maps the current local
directory (denoted by <code>"."</code>) the remote directory
<code>/home/ray/tutorial-mnist-dvc-ray</code> on both the head and worker nodes. It's
useful for transferring the entire project (including <code>.git</code> directory), which
includes code, scripts, and potentially small data files or configuration
files that are necessary for the execution of the pipeline.</li>
<li><code>/home/ray/tutorial-mnist-dvc-ray/.dvc/config.local</code>: This entry indicates
that the local DVC configuration file, <code>.dvc/conf.local</code>, should be explicitly
copied to the corresponding path on the remote nodes. This file includes an
access token for DVC Studio and is thus excluded from Git tracking as a
security measure. Given that the <code>rsync_filter</code> patterns employed in the
configuration are designed to omit all Git-ignored files — encompassing both
data files and the DVC cache — it becomes necessary to list the <code>config.loc</code>
file explicitly. This step ensures the file is transferred despite the filter,
thereby maintaining access to DVC Studio across all nodes in the cluster.</li>
<li><code>/home/ray/.aws/credentials</code>: This maps a custom AWS credentials file from the
local machine (<code>~/.aws/ray-credentials</code>) to the standard AWS credentials path
(<code>/home/ray/.aws/credentials</code>) on the remote nodes. This setup is essential
for enabling AWS SDKs and CLI tools running on the remote nodes to
authenticate with AWS services using the provided credentials.</li>
</ul>
<blockquote>
<p>💡 Note: This example uses the simplified approach to configure access to AWS
resources and DVC Studio. For the production setup, it's crucial to:</p>
<ul>
<li>Ensure that sensitive information, especially credentials, is handled
securely. Use IAM roles for EC2 instances where possible to avoid copying
AWS credentials.</li>
<li>Minimize the size of transferred directories to speed up the cluster
initialization process. Consider excluding large datasets or output
directories if they're not needed on every node or can be accessed from a
shared storage service like Amazon S3.</li>
</ul>
</blockquote>
<h4 id="additional-commands-to-set-up-nodes" style="position:relative;">Additional commands to set up nodes<a href="#additional-commands-to-set-up-nodes" aria-label="additional commands to set up nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>The <code>setup_commands</code> section in the Ray cluster configuration outlines a series
of shell commands executed on all nodes (both head and worker nodes) during
their initialization phase. These commands are crucial for preparing the nodes
with your application's necessary software and libraries.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">setup_commands</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>U ray<span class="token punctuation">[</span>default<span class="token punctuation">]</span>
<span class="token punctuation">-</span> pip install dvc<span class="token punctuation">[</span>s3<span class="token punctuation">]</span>==3.43.1 dvclive==3.41.1
<span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>U pyOpenSSL==24.0.0</code></pre></div>
<p>Here’s a breakdown:</p>
<ul>
<li><code>pip insta dvc[s3]==3.43.1 dvclive==3.41.1</code>**: Installs specific versions of
DVC (Data Version Control) with S3 support and DVCLive. Specifying versions
ensures consistency in running the tutorial example.</li>
<li><code>pip insta -U pyOpenSSL==24.0.0</code>: Updates the pyOpenSSL library to a specific
version after the DVC installation. This is a specific requirement for this
example to ensure the consistency of the Python dependencies.</li>
</ul>
<h3 id="3---start-a-ray-cluster-on-aws" style="position:relative;">3 - Start a Ray Cluster on AWS<a href="#3---start-a-ray-cluster-on-aws" aria-label="3 start a ray cluster on aws permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Run the following command to start your Ray cluster as defined in your
<code>cluster.yaml</code> file:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray up cluster.yaml</code></pre></div>
<p>You can access the Ray dashboard once your Ray cluster is running. This
dashboard provides a real-time view of your cluster's status, including resource
utilization, task progress, and logs.</p>
<p>To open the Ray dashboard, use:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray dashboard cluster.yaml</code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ef0f001747c181994c94bef89591b2de/39600/4-dashboard.png" alt="Ray Dashboard" title="Ray Dashboard" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ray
Dashboard</em></p>
<h3 id="4---connect-to-the-head-node-and-set-up-credentials" style="position:relative;">4 - Connect to the Head Node and Set Up Credentials<a href="#4---connect-to-the-head-node-and-set-up-credentials" aria-label="4 connect to the head node and set up credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Once your Ray cluster is provisioned and all nodes are correctly set up with the
necessary software, the next step involves connecting to the head node to
configure access credentials for GitHub, Amazon S3, and other services like DVC
Studio. These credentials are essential for version control, data storage, and
continuous integration and deployment (CI/CD) processes.</p>
<h4 id="connecting-to-the-cluster" style="position:relative;">Connecting to the Cluster<a href="#connecting-to-the-cluster" aria-label="connecting to the cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>To initiate a secure connection to the head node of your Ray cluster, use the
following command. This command utilizes the cluster configuration defined in
<code>cluster.yaml</code>, providing you with a terminal session on the head node:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Connect to cluster</span>
ray attach cluster.yaml</code></pre></div>
<h4 id="setting-up-git-credentials" style="position:relative;">Setting Up Git Credentials<a href="#setting-up-git-credentials" aria-label="setting up git credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>Once connected to the head node, configure Git with your username and email to
enable commits to your repositories. Additionally, an access token can be set up
for GitHub to securely push and pull without using a password. Replace
<code><your_username></code> with your GitHub username and <code><your_email></code> with your email
associated with GitHub, and <code><your_github_pat></code> with your GitHub Personal Access
Token (PAT).</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> config <span class="token parameter variable">--global</span> user.name <span class="token string">"<your_username>"</span>
<span class="token function">git</span> config <span class="token parameter variable">--global</span> user.email <span class="token string">"<your_email>"</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">GITHUB_ACCESS_TOKEN</span><span class="token operator">=</span><span class="token operator"><</span>your_github_pat<span class="token operator">></span></code></pre></div>
<p>Use the access token to update the repository's remote URL for authentication.
This step assumes you have cloned the repository and are inside the repository
directory.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> remote set-url origin https://your_username:<span class="token variable">${GITHUB_ACCESS_TOKEN}</span>@github.com/your_username/tutorial-mnist-dvc-ray.git</code></pre></div>
<h4 id="run-tests-to-check-the-correct-setup" style="position:relative;">Run tests to check the correct setup<a href="#run-tests-to-check-the-correct-setup" aria-label="run tests to check the correct setup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>Run a few test scripts to ensure AWS credentials are correctly set up on the
cluster for accessing S3 services.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span>
python src/test_scripts/test_s3.py</code></pre></div>
<blockquote>
<p>The example scripts are inside the <code>~/tutorial-mnist-dvc-ray</code> directory</p>
</blockquote>
<h3 id="5---run-dvc-pipelines-on-the-remote-ray-cluster" style="position:relative;">5 - Run DVC Pipelines on the remote Ray Cluster<a href="#5---run-dvc-pipelines-on-the-remote-ray-cluster" aria-label="5 run dvc pipelines on the remote ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Navigate to the <code>tutorial-mnist-dvc-ray</code> directory and run a new experiment</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span>
dvc exp run <span class="token parameter variable">-f</span></code></pre></div>
<p>This will start the pipeline, running the <code>tune</code> and <code>train</code> stages as defined
in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file, utilizing distributed computation with Ray.</p>
<p>You may see live updates of metrics and plots in
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/28d12948c82876a84b7ae984c2a59f6d/39600/5-dvc-studio.png" alt="Live Metrics Tracking with DVC Studio" title="Live Metrics Tracking with DVC Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Live Metrics Tracking with DVC Studio</em></p>
<p>This setup with DVC and DVCLive offers a structured approach to monitoring model
performance through metrics tracking and visualization. It aids in understanding
the model's behavior over training, facilitating decisions on model adjustments
or improvements. Moreover, after the experiment is complete, you may change the
plot template, add new plots, or customize the existing ones to suit your
specific requirements if needed.</p>
<h3 id="6---commit--push-experiments" style="position:relative;">6 - Commit & push experiments<a href="#6---commit--push-experiments" aria-label="6 commit push experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Once you've completed an experiment and are ready to share or preserve the
results, DVC provides a seamless workflow to list, select, and commit the
outcomes of your experiments. Here’s how to manage and share your experiment
results using DVC and Git.</p>
<p>Use <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> to get an overview of all experiments, including their
metrics and parameters.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token punctuation">(</span>base<span class="token punctuation">)</span> ray@ip-172-31-41-217:~/tutorial-mnist-dvc-ray$ dvc exp show
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span>
Experiment Created loss accuracy step tune.run_tune tune.epoch_size tune.test_size tune.results_dir<span class="token operator">></span>
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span>
workspace - <span class="token number">0.38723</span> <span class="token number">0.8602</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span>
cloud-remote 02:17 PM <span class="token number">0.3951</span> <span class="token number">0.8542</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span>
├── dbcdc38 <span class="token punctuation">[</span>broad-teas<span class="token punctuation">]</span> 06:22 AM <span class="token number">0.38723</span> <span class="token number">0.8602</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span>
└── 11e273e <span class="token punctuation">[</span>metal-sick<span class="token punctuation">]</span> 06:21 AM <span class="token number">0.3951</span> <span class="token number">0.8542</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span>
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span>
<span class="token punctuation">(</span>END<span class="token punctuation">)</span></code></pre></div>
<p>After identifying the successful experiment (e.g., <code>broad-teas</code>), you can use
DVC to create a new branch for this experiment, facilitating version control and
collaboration.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc exp branch broad-tea</code></pre></div>
<p>Next, push the newly created branch to your remote Git repository and upload
artifacts to the DVC remote storage.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> checkout broad-teas-branch
<span class="token function">git</span> push origin broad-teas-branch
dvc push</code></pre></div>
<h3 id="7---stop-cluster" style="position:relative;">7 - Stop Cluster<a href="#7---stop-cluster" aria-label="7 stop cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Turn off the remote cluster when not in use to save money and reduce
environmental impact!</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray down cluster.yaml</code></pre></div>
<h2 id="-summing-up-dvc--ray-integration" style="position:relative;">🎨 Summing Up: DVC + Ray Integration<a href="#-summing-up-dvc--ray-integration" aria-label=" summing up dvc ray integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The DVC + Ray integration presents a comprehensive solution to the challenges of
running machine learning experiments at scale. By addressing specific issues
related to auto-scaling, execution optimization, live metrics tracking, and data
synchronization, this setup ensures that machine learning teams can focus on
innovation and experimentation backed by a robust, scalable, and efficient
infrastructure.</p>
<p>Integrating DVC with Ray combines the best data management and distributed
computing for machine learning projects. Here's a simplified overview of what we
covered:</p>
<ol>
<li><strong>Setup Ray Cluster</strong>: Configured a Ray cluster to run on AWS, utilizing
Docker for consistent environments and specifying node types for resource
optimization.</li>
<li><strong>Node Provisioning</strong>: Automated the setup of head and worker nodes for a
scalable ML experiment environment.</li>
<li><strong>Artifact Sync</strong>: Ensured DVC pipeline artifacts were synchronized across
the cluster, keeping data and models consistent.</li>
<li><strong>Manage Experiments with DVC Studio</strong>: Demonstrated how to use DVC, DVCLive,
and DVC Studio for metrics tracking, artifacts versioning, and experiment
management.</li>
<li><strong>Commit and Share Results</strong>: Highlighted the process of committing
experiment results and pushing them to a repository for collaboration and
reproducibility.</li>
</ol>
<p><strong>Key Takeaways</strong>:</p>
<ul>
<li><strong>Scalability</strong>: Ray and AWS offer a flexible and scalable setup for ML
experiments.</li>
<li><strong>Reproducibility</strong>: DVC adds data version control, enhancing experiment
reproducibility.</li>
<li><strong>Automation</strong>: The integration shows how to automate the ML workflow, from
setup to experiment tracking.</li>
<li><strong>Collaboration</strong>: Using Git and DVC supports effective team collaboration on
ML projects.</li>
</ul>
<blockquote>
<p>💡 Did you find this tutorial interesting? Please leave your comments and
share your experience with DVC and Ray! Join us on
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p>
</blockquote>
<h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://dvc.org/doc/studio/user-guide/experiments/explore-ml-experiments" target="_blank" rel="nofollow noopener noreferrer">DVC Studio: Explore ML Experiments</a></li>
<li><a href="https://docs.ray.io/en/latest/ray-overview/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs: Getting Started</a></li>
<li><a href="https://www.anyscale.com/blog/ray-common-production-challenges-for-generative-ai-infrastructure" target="_blank" rel="nofollow noopener noreferrer">How Ray solves common production challenges for Generative AI infrastructure</a></li>
<li><a href="https://medium.com/samsara-engineering/building-a-modern-machine-learning-platform-with-ray-eb0271f9cbcf" target="_blank" rel="nofollow noopener noreferrer">Building a Modern Machine Learning Platform with Ray</a></li>
</ul>https://dvc.org/blog/dvc-rayhttps://dvc.org/blog/dvc-rayTue, 12 Mar 2024 00:00:00 GMT<p>Training models at the scale of the Gemini or GPT-4 models requires advanced
tools that manage complexity while ensuring efficiency. This tutorial explores
how Data Version Control (DVC) can be a game-changer for ambitious projects. DVC
simplifies AI development by automating pipelines, managing versions, and
tracking experiments while embracing GitOps for reproducibility. It excels in
both local and cloud environments for traditional ML workflows. However, the
rise of Generative AI and complex deep learning projects demands scalable,
distributed training solutions.</p>
<p>This tutorial is divided into two parts. Part 1 sets the foundation for scalable
and efficient machine learning workflows by leveraging Ray’s distributed
computing capabilities and DVC’s data version control.</p>
<p>In <a href="https://dvc.ai/blog/dvc-ray-part-2" target="_blank" rel="nofollow noopener noreferrer">Part 2</a>, we extend the solution to a Ray
Cluster on AWS, demonstrating how to adapt the setup for cloud-based distributed
computing. This involves configuring AWS resources, deploying Ray clusters in
the cloud, and running DVC-managed pipelines at scale.</p>
<blockquote>
<p>This guide is tailored for ML Engineers and Team Leads in AI projects who aim
to speed up training, optimize resources, and ensure reproducibility across
distributed environments. I am looking forward to hearing your feedback and
improvements! 🙌</p>
</blockquote>
<blockquote>
<p>We would like to express our gratitude to
<a href="https://www.linkedin.com/in/schuh/" target="_blank" rel="nofollow noopener noreferrer">Andreas Schuh</a> from
<a href="https://www.heartflow.com/" target="_blank" rel="nofollow noopener noreferrer">HeartFlow</a> for his contribution to this solution
and for providing ideas and feedback for the blog posts. 🤝</p>
</blockquote>
<details>
<summary>Table Of Contents</summary>
<ul>
<li><a href="#why-dvc-and-ray">Why DVC and Ray?</a></li>
<li><a href="#tutorial-scope">Tutorial Scope</a>
<ul>
<li><a href="#high-level-solution-design">High-level solution design</a></li>
<li><a href="#prerequisites">Prerequisites</a></li>
</ul>
</li>
<li><a href="#-installation">👩💻 Installation</a></li>
<li><a href="#-get-started-with-ray">⭐ Get Started with Ray</a>
<ul>
<li><a href="#1---overview-of-the-ray-framework">1 - Overview of the Ray Framework</a></li>
<li><a href="#2---start-a-ray-cluster">2 - Start a Ray Cluster</a></li>
<li><a href="#3---run-a-test-script-on-the-ray-cluster">3 - Run a test script on the Ray Cluster</a></li>
</ul>
</li>
<li><a href="#%EF%B8%8F-run-dvc-pipeline-on-a-ray-cluster">🏃♂️ Run DVC Pipeline on a Ray Cluster</a>
<ul>
<li><a href="#1---design-solution-for-dvc--ray">1 - Design Solution for DVC + Ray</a></li>
<li><a href="#2---create-a-dvc-pipeline">2 - Create a DVC pipeline</a>
<ul>
<li><a href="#tune-stage">Tune Stage</a></li>
<li><a href="#train-stage">Train Stage</a></li>
</ul>
</li>
<li><a href="#3---run-dvc-pipelines-on-ray-cluster">3 - Run DVC pipelines on Ray Cluster</a></li>
</ul>
</li>
<li><a href="#-discuss-the-solution-design">💬 Discuss the Solution Design</a>
<ul>
<li><a href="#%EF%B8%8F-use-dvc-to-run-scripts-calling-ray-api">☝️ Use DVC to run scripts calling Ray API</a></li>
<li><a href="#%EF%B8%8F-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure">☝️ Persist DVC stage outputs to keep them available for downstream stages in case of failure</a></li>
<li><a href="#%EF%B8%8F-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code">☝️ <strong>Use DVCLive to track live metrics updates with DVC Studio and DVC Extension for VS Code</strong></a></li>
<li><a href="#%EF%B8%8F-propagate-dvc-environment-variables-to-worker-nodes">☝️ Propagate DVC environment variables to Worker nodes</a></li>
<li><a href="#%EF%B8%8F-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository">☝️ Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project repository</a></li>
</ul>
</li>
<li><a href="#-summing-up-dvc--ray-integration">🎨 Summing Up: DVC + Ray Integration</a>
<ul>
<li><a href="#key-takeaways">Key Takeaways</a></li>
<li><a href="#looking-ahead-to-part-2">Looking Ahead to Part 2</a></li>
</ul>
</li>
<li><a href="#references">References</a></li>
</ul>
</details>
<h2 id="why-dvc-and-ray" style="position:relative;">Why DVC and Ray?<a href="#why-dvc-and-ray" aria-label="why dvc and ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is an open-source tool that brings GitOps and
reproducibility to data management, ML experiments, and model development. It
connects versioned data sources and code with pipelines, tracks experiments, and
registers models — all based on GitOps principles.</p>
<p><a href="https://www.ray.io/" target="_blank" rel="nofollow noopener noreferrer">Ray</a> is an open-source unified computing framework that
makes scaling AI and Python workloads easy — from reinforcement learning to deep
learning to tuning and model serving. Ray makes it a breeze to scale your
compute-intensive tasks from a single machine to a massive cluster without
losing your mind.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d89dbe2bc88dcfeb76d8cde4662ce349/39600/2-dvc-ray-distributed-ml.png" alt="DVC + Ray for distributed ML" title="DVC + Ray for distributed ML" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>DVC and Ray make your ML projects more manageable and prepare them to tackle the
challenges of tomorrow’s AI-driven landscape. Let’s explore this dynamic duo and
unlock new potentials in your MLOps journey!</p>
<blockquote>
<p>💡 <strong>Want to learn more about DVC?</strong></p>
<p>Join our online course about DVC:
<a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools for Data Scientists & Analysts course</a>!</p>
</blockquote>
<h2 id="tutorial-scope" style="position:relative;">Tutorial Scope<a href="#tutorial-scope" aria-label="tutorial scope permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This tutorial will guide users through creating automated, scalable, and
distributed ML pipelines using DVC (Data Version Control) and Ray. We start with
configuring the Ray Cluster for local and cloud environments. Then, we discuss
the challenges of running DVC in distributed environments. Then, we’ll run a few
examples of using DVC and Ray. By the end of the tutorial, you will be able to
design, run, and manage ML pipelines distributed over multiple nodes and
trackable through version control.</p>
<p>For <strong>DVC users</strong>, this tutorial offers several advantages:</p>
<ul>
<li>Bring Distributed Computing Efficiency to DVC projects</li>
<li>Easy use of AWS Cloud for Development and Production workflows</li>
<li>Enable automated pipelines and data versioning in ML projects with Ray</li>
</ul>
<p>For <strong>Ray users</strong>, this tutorial aims to highlight the benefits of integrating
DVC:</p>
<ul>
<li>Enhance Model Training Reproducibility with DVC’s data versioning capabilities</li>
<li>Streamline ML Pipeline Management through DVC’s structured approach</li>
<li>Facilitate Efficient Collaboration among teams by leveraging DVC for shared
data and model management</li>
</ul>
<h3 id="high-level-solution-design" style="position:relative;">High-level solution design<a href="#high-level-solution-design" aria-label="high level solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s overview the high-level design of our target solution with DVC and Ray.</p>
<ol>
<li>Users can manage Ray Cluster and run DVC pipelines from a “local”
environment.</li>
<li>Ray distributes workloads across multiple workers and can auto-scale cluster
nodes.</li>
<li>During the training, DVCLive logs live updates of metrics and parameters to
DVC Studio.</li>
<li>DVC utilizes S3 to sync states between a Worker and Head nodes.</li>
<li>DVC uses remote storage (AWS S3) to manage data and model artifacts.</li>
<li>Users commit the results of the experiment to Git and DVC Remote Storage.</li>
</ol>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/878eabd6bb9ecbcac34ceddabccf71f2/39600/3-solution-design.png" alt="Solution Design" title="Solution Design" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>High-Level Solution Design</em></p>
<h3 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We expect that you:</p>
<ul>
<li>Have some experience with Machine Learning or Data Engineering pipelines</li>
<li>Are familiar with DVC</li>
</ul>
<p>To follow this tutorial, you’ll need the following tools:</p>
<ul>
<li>Git</li>
<li>Python 3.11 or above</li>
<li>AWS CLI (if you want to run pipelines in AWS)</li>
</ul>
<h2 id="-installation" style="position:relative;">👩💻 Installation<a href="#-installation" aria-label=" installation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Creating an ML pipeline that runs distributed tasks is a powerful way to manage
and scale your machine learning workflows. With DVC, we can efficiently
orchestrate our pipeline stages and handle experiment outputs.</p>
<p>To clone the example repository, you can follow these steps:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> clone https://github.com/iterative/tutorial-mnist-dvc-ray.git
<span class="token builtin class-name">cd</span> tutorial-mnist-dvc-ray</code></pre></div>
<p>Install Python dependencies:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python3 <span class="token parameter variable">-m</span> venv .venv
<span class="token builtin class-name">source</span> .venv/bin/activate
pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt
<span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span></code></pre></div>
<h2 id="-get-started-with-ray" style="position:relative;">⭐ Get Started with Ray<a href="#-get-started-with-ray" aria-label=" get started with ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="1---overview-of-the-ray-framework" style="position:relative;">1 - Overview of the Ray Framework<a href="#1---overview-of-the-ray-framework" aria-label="1 overview of the ray framework permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://docs.ray.io/en/latest/ray-overview/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray</a> is a framework for
scaling AI and Python applications. For AI and ML applications, Ray helps to
scale jobs without needing infrastructure expertise:</p>
<ul>
<li>Efficiently parallelize and distribute ML workloads across multiple nodes and
GPUs.</li>
<li>Leverage the ML ecosystem with native and extensible integrations.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8f7a888d791d71b3524d7c72335a5efd/39600/4-ray-stack.png" alt="Stack of Ray libraries - a unified toolkit for ML workloads" title="Stack of Ray libraries - a unified toolkit for ML workloads" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Stack of Ray libraries - A Unified Toolkit For ML Workloads
(<a href="https://docs.ray.io/en/latest/ray-overview/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a>)</em></p>
<p>In this tutorial, we work with Ray Clusters and Ray AI Libraries (Ray Tune and
Ray Train).
<a href="https://docs.ray.io/en/latest/cluster/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray Cluster</a> is a
set of
<a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-worker-nodes" target="_blank" rel="nofollow noopener noreferrer">Worker nodes</a>
connected to a common Ray
<a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-head-node" target="_blank" rel="nofollow noopener noreferrer">Head node</a>.</p>
<ul>
<li>The Head node serves as the central coordination point for the Ray cluster. It
manages the cluster’s metadata, maintains the cluster state, and handles task
scheduling and management.</li>
<li>Worker nodes are the computational workhorses of the Ray cluster. They are
responsible for executing tasks and running computations for applications.</li>
</ul>
<p><img src="https://docs.ray.io/en/latest/_images/ray-cluster.svg" alt="Two nodes Ray Cluster">
<em>A Ray cluster with two worker nodes. Each node runs Ray helper processes to
facilitate distributed scheduling and memory management. The head node runs
additional control processes (highlighted in blue). Source:
<a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#head-node" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p>
<p>Ray clusters can be fixed-size or autoscale up and down according to the
resources requested by applications running on the cluster.</p>
<p><a href="https://docs.ray.io/en/latest/tune/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray Tune</a> is a Python Library
that automates the hyperparameter tuning process across distributed resources.
By integrating Ray Tune into the experiment workflow, we can evaluate numerous
hyperparameter combinations in parallel, speeding up the search for optimal
model configurations.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cf01238905df716a8a7f3855be9739f3/39600/5-ray-tune.png" alt="Distributed tuning with Ray" title="Distributed tuning with Ray" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Distributed tuning with distributed training per trial. Source:
<a href="https://docs.ray.io/en/latest/ray-overview/use-cases.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p>
<p><a href="https://docs.ray.io/en/latest/train/train.html" target="_blank" rel="nofollow noopener noreferrer">Ray Train</a> creates a setup to
scale model training code from a single machine to a cluster of machines in the
cloud and abstracts away the complexities of distributed computing. At a high
level of abstraction, it distributes and runs training jobs among worker nodes.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9f2b1034200a057d471da25748e93e17/39600/6-ray-train-overview.png" alt="Ray Train Overview" title="Ray Train Overview" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Ray Train Overview. Source:
<a href="https://docs.ray.io/en/latest/train/overview.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p>
<h3 id="2---start-a-ray-cluster" style="position:relative;">2 - Start a Ray Cluster<a href="#2---start-a-ray-cluster" aria-label="2 start a ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<blockquote>
<p>💡 Navigate to the <code>main</code> branch in the repository</p>
</blockquote>
<p>To start a Ray Cluster, first initiate the Ray head node. The head node is the
primary node in the Ray cluster that manages the worker nodes. Since this is a
local setup, your machine will act as both the Head and Worker nodes. Use the
following command:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray start <span class="token parameter variable">--head</span></code></pre></div>
<p>This command starts the Ray cluster with your machine acting as the head node.</p>
<p>To monitor and debug Ray, view the dashboard at
<a href="http://127.0.0.1:8265/" target="_blank" rel="nofollow noopener noreferrer">http://127.0.0.1:8265/</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dd970b852f021bb7250e93c79d760c37/39600/7-ray-dashboard.png" alt="Ray Dashboard" title="Ray Dashboard" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ray
Dashboard - Cluster Nodes</em></p>
<blockquote>
<p>💡 Multi-node Ray clusters are only supported on Linux. You may deploy Windows
and OSX clusters for development by setting the environment
variable <code>RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1</code>. Source:
<a href="https://docs.ray.io/en/latest/cluster/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray Clusters Overview</a>.</p>
</blockquote>
<h3 id="3---run-a-test-script-on-the-ray-cluster" style="position:relative;">3 - Run a test script on the Ray Cluster<a href="#3---run-a-test-script-on-the-ray-cluster" aria-label="3 run a test script on the ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can run a simple test script to ensure your local Ray cluster works
correctly. In your project directory, create a file named <strong><code>hello_cluster.py</code></strong>
inside the <strong><code>src/test_scripts</code></strong> directory. Add a script to connect to the Ray
cluster and print a message. Here’s an example script:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> ray
<span class="token decorator annotation punctuation">@ray<span class="token punctuation">.</span>remote</span>
<span class="token keyword">def</span> <span class="token function">hello_world</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">return</span> “Hello Ray cluster”
<span class="token comment"># Automatically connect to the running Ray cluster.</span>
ray<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>ray<span class="token punctuation">.</span>get<span class="token punctuation">(</span>hello_world<span class="token punctuation">.</span>remote<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div>
<p>Execute the script using Python. Open your terminal and run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python src/test_scripts/hello_cluster.py</code></pre></div>
<p>You should see an output similar to this:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token number">2023</span>-11-14 <span class="token number">12</span>:11:17,363 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: <span class="token number">192.168</span>.100.19:6379<span class="token punctuation">..</span>.
<span class="token number">2023</span>-11-14 <span class="token number">12</span>:11:17,370 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
Hello Ray cluster</code></pre></div>
<p>This output indicates that your script has successfully connected to the local
Ray cluster and executed the print statement.</p>
<h2 id="️-run-dvc-pipeline-on-a-ray-cluster" style="position:relative;">🏃♂️ Run DVC Pipeline on a Ray Cluster<a href="#%EF%B8%8F-run-dvc-pipeline-on-a-ray-cluster" aria-label="️ run dvc pipeline on a ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You have a single-node Ray Cluster at this step on your local machine. Let’s
start with the DVC pipeline setup.</p>
<p>Goals for this section:</p>
<ul>
<li>Design a Solution for DVC + Ray.</li>
<li>Create a DVC pipeline with two stages: tune and train.</li>
<li>Modify DVCLive to sync metrics and parameters with DVC Studio.</li>
</ul>
<h3 id="1---design-solution-for-dvc--ray" style="position:relative;">1 - Design Solution for DVC + Ray<a href="#1---design-solution-for-dvc--ray" aria-label="1 design solution for dvc ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The technical design calls for a structure where ML experiment scripts, managed
by DVC, invoke Ray for their computation needs. DVC is the orchestrator,
invoking the appropriate Ray functions for distributed processing.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6171b601ceccb690ed7d41ba04186227/39600/8-solution-design-local.png" alt="Design POC Solution for DVC + Ray (local)" title="Design POC Solution for DVC + Ray (local)" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Design POC Solution for DVC + Ray (local)</em></p>
<p>This diagram outlines the integration of DVC (Data Version Control) with a Ray
cluster for running ML experiments in a distributed manner:</p>
<ol>
<li>DVC initiates the process by running a stage script. The <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> pipeline
definition is the blueprint for the ML workflow, defining stages that utilize
Ray for hyperparameter tuning and subsequent training stages.</li>
<li>Ray Job Submission: The stage script (e.g., <code>src/stages/tune.py</code>) starts a
Ray application that submits computation jobs to Ray. The
<code>src/stages/tune.py</code> script utilizes Ray Tune’s <code>Tuner</code> class to define and
run the hyperparameter tuning trials.</li>
<li>Ray Cluster contains a single Head Node where the actual computation occurs.
(Note: In the production cluster, Ray runs the jobs distributed across
multiple worker nodes). Ray saves results for each job (trial) to a local
directory in a worker node (outside the DVC project repo).</li>
<li>After all jobs complete, the stage script retrieves results from Ray’s trial
directories to the DVC project repo (if needed).</li>
<li>DVC manages the outputs of the pipeline, ensuring reproducibility and
traceability.</li>
</ol>
<p>The result is a robust framework for conducting and managing ML experiments that
are scalable, reproducible, and efficiently optimized. This framework not only
streamlines the experimentation process but also simplifies the transition of
models from development to production.</p>
<h3 id="2---create-a-dvc-pipeline" style="position:relative;">2 - Create a DVC pipeline<a href="#2---create-a-dvc-pipeline" aria-label="2 create a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In this tutorial, the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file contains only two stages in the ML
pipeline: <code>tune</code> and <code>train</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf19a010eb28547e700c40d28eae4b1d/39600/9-dvc-pipeline.png" alt="DVC pipeline" title="DVC pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC
pipeline configuration in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> with <code>tune</code> and <code>train</code> stages, and <code>plots</code>
sections</em></p>
<h4 id="tune-stage" style="position:relative;">Tune Stage<a href="#tune-stage" aria-label="tune stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>This initial stage is responsible for hyperparameter tuning. It uses Ray to
distribute the computation involved in this process. The stage executes a Python
script <code>tune.py</code> that optimizes hyperparameters using the Ray Tune. The output
of this stage is <code>best_params.yaml</code>, which contains the best hyperparameters
found during the tuning process.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">tune</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/tune.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config params.yaml
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> tune
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>tune.results_dir<span class="token punctuation">}</span>/best_params.yaml<span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div>
<p>Use two specific configuration parameters for the <code>best_params.yaml</code> output:</p>
<ul>
<li>Set <code>cache: false</code> to instruct DVC not to cache the file but version it with
Git.</li>
<li>Set <code>persist: true</code> to instruct DVC not to remove the file before reproducing
the stage. It’s useful for stage dependencies when you work in an unstable
environment (or debugging), and the stage script can fail for any reason. In
this example, even if the <code>tune</code> stage fails, you can run the <code>train</code> stage
using <code>best_params.yaml</code> from the previous run.</li>
</ul>
<h4 id="train-stage" style="position:relative;">Train Stage<a href="#train-stage" aria-label="train stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p>The Train Stage runs distributed computation via Ray. This stage depends on
<code>best_params.yaml</code> generated by the <code>tune</code> stage to access the optimal
hyperparameters for training the model. The <code>train</code> stage is invoked by the
<code>train.py</code> script, which will train the model based on the tuned parameters.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/train.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config params.yaml
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> train
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>tune.results_dir<span class="token punctuation">}</span>/best_params.yaml
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>train.results_dir<span class="token punctuation">}</span>/model.pth</code></pre></div>
<p>The trained model is saved as <code>model.pth</code>, with the path again parameterized to
allow flexibility in the output location. The output model is automatically
cached and versioned with DVC.</p>
<h3 id="3---run-dvc-pipelines-on-ray-cluster" style="position:relative;">3 - Run DVC pipelines on Ray Cluster<a href="#3---run-dvc-pipelines-on-ray-cluster" aria-label="3 run dvc pipelines on ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To execute your automated and distributed ML pipeline with DVC, perform the
following steps:</p>
<ul>
<li>Set the PYTHONPATH environment variable to ensure Python scripts can access
modules within your project’s directory by setting the <code>PYTHONPATH</code>
environment variable.</li>
<li>Run DVC pipeline with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command.</li>
</ul>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span>
dvc exp run</code></pre></div>
<p>This will start the pipeline, running the <code>tune</code> and <code>train</code> stages as defined
in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file, utilizing distributed computation with Ray.</p>
<p>You may see live updates of metrics and plots in
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> and
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a>.
DVC can generate and render plots based on your project’s data. Metrics and
plots logged with DVCLive can be visualized in DVC Studio and DVC Extension for
VS Code.</p>
<p>A few benefits of tracking and visualizing metrics and plots with DVC
(<a href="https://dvc.org/doc/user-guide/experiment-management/visualizing-plots" target="_blank" rel="nofollow noopener noreferrer">see docs</a>):</p>
<ul>
<li>Enhanced Experiment Tracking: Compare metrics, parameters, version of data,
and plots between experiments in a live mode (docs:
<a href="https://dvc.org/doc/studio/user-guide/experiments/visualize-and-compare" target="_blank" rel="nofollow noopener noreferrer">Visualize and Compare experiments</a>).</li>
<li>Customize Visualization: Define visualization template, select data to be
visualized and titles interactively, before or after the experiment is
complete (docs:
<a href="https://dvc.org/doc/user-guide/experiment-management/visualizing-plots#defining-plots" target="_blank" rel="nofollow noopener noreferrer">Defining plots</a>).</li>
<li>Share & Version Control for Metrics: You can
send <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">live metrics and plots</a> to <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>, <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#push-experiments" target="_blank" rel="nofollow noopener noreferrer">push</a> completed experiments (including
data, models, and code), and convert an experiment into
a <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#persist-experiment" target="_blank" rel="nofollow noopener noreferrer">persistent</a> branch
or commit in your Git repo (docs
<a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments" target="_blank" rel="nofollow noopener noreferrer">Sharing Experiments</a>).</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/979a10864ed18812ec17855a2ec3b7b5/39600/11-experiment-tracking.png" alt="Experiment tracking" title="Experiment tracking" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Experiment tracking with <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> and
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></em></p>
<blockquote>
<p>💡 Note: Sometimes, when you run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> with a local Ray Cluster, the
process may get stuck with
<code>Connecting to existing Ray cluster at address: 192.168.100.19:6379...</code>
message due to a <code>ConnectionError</code> in Ray. In this case, open a new terminal
session, export <code>PYTHONPATH</code>, and run the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command there.</p>
</blockquote>
<h2 id="-discuss-the-solution-design" style="position:relative;">💬 Discuss the Solution Design<a href="#-discuss-the-solution-design" aria-label=" discuss the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This section above explains a simple example of running DVC and Ray together.
It’s not a production setup. But it’s a good start for developing and debugging
the DVC pipeline with Ray.</p>
<p>Let’s think about what decisions we made and discuss some details:</p>
<ol>
<li>Use DVC to run scripts calling Ray API.</li>
<li>Persist DVC stage outputs to keep them available for downstream stages in
case of failure.</li>
<li>Use DVCLive to track metrics only on a worker with a rank of 0.</li>
<li>Propagate DVC environment variables to a worker node using TorchTrainer
<code>train_loop_config</code>.</li>
<li>Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project
repository.</li>
</ol>
<h3 id="️-use-dvc-to-run-scripts-calling-ray-api" style="position:relative;">☝️ Use DVC to run scripts calling Ray API<a href="#%EF%B8%8F-use-dvc-to-run-scripts-calling-ray-api" aria-label="️ use dvc to run scripts calling ray api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Ray framework provides a rich Python API for distributed data processing, model
tuning, and training. Wrapping Ray scripts into callable Python modules
simplifies using DVC. Therefore, you get two benefits:</p>
<ul>
<li>Get scalability and distributed training with Ray</li>
<li>Get reproducibility and versioning with DVC</li>
</ul>
<p>A template of the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> for DVC + Ray:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">first_stage</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python first_script_with_ray.py
<span class="token punctuation">...</span>
<span class="token key atrule">next_stage</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python second_script_with_ray.py
<span class="token punctuation">...</span></code></pre></div>
<h3 id="️-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure" style="position:relative;">☝️ Persist DVC stage outputs to keep them available for downstream stages in case of failure<a href="#%EF%B8%8F-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure" aria-label="️ persist dvc stage outputs to keep them available for downstream stages in case of failure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Set <code>persist: true</code> to instruct DVC not to remove the file before reproducing
the stage. It’s useful for stage dependencies when you work in an unstable
environment (or debugging), and the stage script might fail.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">first_stage</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python first_script_with_ray.py
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">stage_output.file</span><span class="token punctuation">:</span>
<span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div>
<h3 id="️-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code" style="position:relative;">☝️ <strong>Use DVCLive to track live metrics updates with DVC Studio and <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></strong><a href="#%EF%B8%8F-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code" aria-label="️ use dvclive to track live metrics updates with dvc studio and dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Ray Train lets you use native experiment tracking libraries inside
the <a href="https://docs.ray.io/en/latest/train/overview.html#train-overview-training-function" target="_blank" rel="nofollow noopener noreferrer">train_func</a> function.
<a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> is a highly flexible and lightweight
library that simplifies experiment tracking in DVC projects.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live
<span class="token keyword">with</span> Live<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span></code></pre></div>
<p>This solution uses log metrics with <code>Live()</code> inside
<code>the train_func_per_worker()</code> function.</p>
<p>One significant distinction between distributed and non-distributed training
lies in the parallel execution of multiple processes in distributed training
setups, which may yield identical results under specific configurations. When
all processes communicate results to the tracking backend, there’s a risk of
receiving duplicate entries (check
<a href="https://docs.ray.io/en/latest/train/user-guides/experiment-tracking.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs</a>
for details).</p>
<p>Therefore, a few adjustments should be made to DVCLive.</p>
<ol>
<li>Use DVCLive to track metrics only on a worker with a rank of 0.</li>
<li>Use the <code>DVC_ROOT</code> variable to create the <a href="https://dvc.org/doc/dvclive/live/"><code>Live(dir=...)</code></a> object. DVC
automatically sets the value for the <code>DVC_ROOT</code> variable to the directory of
your DVC repository and ensures Ray writes metrics inside the repo
(<a href="https://dvc.org/doc/user-guide/env" target="_blank" rel="nofollow noopener noreferrer">docs</a>).</li>
</ol>
<p>As a result, the DVCLive usage code inside the <code>train_func_per_worker()</code>
function looks like the example below.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span>
<span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment"># Initialize DVC Live</span>
live <span class="token operator">=</span> <span class="token boolean">None</span>
rank <span class="token operator">=</span> ray<span class="token punctuation">.</span>train<span class="token punctuation">.</span>get_context<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_world_rank<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># Create a Live object on the rank 0 worker</span>
<span class="token keyword">if</span> rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span>
live <span class="token operator">=</span> Live<span class="token punctuation">(</span>
<span class="token builtin">dir</span><span class="token operator">=</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"DVC_ROOT"</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"results/dvclive"</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span>
<span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>epochs<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment"># ...epoch training</span>
<span class="token comment"># Log metrics with print()</span>
<span class="token keyword">if</span> live<span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span><span class="token string">"loss"</span><span class="token punctuation">,</span> test_loss<span class="token punctuation">)</span>
live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span><span class="token string">"accuracy"</span><span class="token punctuation">,</span> accuracy<span class="token punctuation">)</span>
live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
<p>Utilizing DVCLive in Python code for logging metrics and plots automatically
generates the necessary configurations for plots within the dvc.yaml file. Below
is an example configuration for metrics and plots:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> results/dvclive/metrics.json
<span class="token key atrule">plots</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">accuracy</span><span class="token punctuation">:</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> step
<span class="token key atrule">y</span><span class="token punctuation">:</span>
<span class="token key atrule">results/dvclive/plots/metrics/accuracy.tsv</span><span class="token punctuation">:</span> accuracy
<span class="token key atrule">title</span><span class="token punctuation">:</span> Accuracy
<span class="token key atrule">x_label</span><span class="token punctuation">:</span> Step
<span class="token key atrule">y_label</span><span class="token punctuation">:</span> Accuracy
<span class="token punctuation">-</span> <span class="token key atrule">loss</span><span class="token punctuation">:</span>
<span class="token key atrule">template</span><span class="token punctuation">:</span> simple
<span class="token key atrule">x</span><span class="token punctuation">:</span> step
<span class="token key atrule">y</span><span class="token punctuation">:</span>
<span class="token key atrule">results/dvclive/plots/metrics/loss.tsv</span><span class="token punctuation">:</span> loss
<span class="token key atrule">title</span><span class="token punctuation">:</span> Loss
<span class="token key atrule">x_label</span><span class="token punctuation">:</span> Step
<span class="token key atrule">y_label</span><span class="token punctuation">:</span> Accuracy
<span class="token punctuation">-</span> results/tune/plots/images</code></pre></div>
<p>The train stage logs metrics and plots to results/dvclive. Datapoints for
metrics and plots are saved in files and visualized later in DVC Studio and VS
Code.<br>
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/65e5240e1e47be1e405116f587cb2b85/39600/10-2-train-metrics.png" alt="Metrics and plot" title="Metrics and plot" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Metrics
plot generated by the <code>tune</code> stage</em></p>
<p>The tune stage logs a mean_accuracy_plot.png file to visualize metrics for
tuning trials.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 567px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca2e4e8b125c2f57b537f0c090594b8d/0a7db/10-tune-metrics.png" alt="Metrics plot" title="Metrics plot" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Metrics plot generated by the <code>tune</code> stage</em></p>
<h3 id="️-propagate-dvc-environment-variables-to-worker-nodes" style="position:relative;">☝️ Propagate DVC environment variables to Worker nodes<a href="#%EF%B8%8F-propagate-dvc-environment-variables-to-worker-nodes" aria-label="️ propagate dvc environment variables to worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC environment variables are necessary for every Ray worker because they
provide essential information and configurations for DVCLive, facilitating
experiment tracking. These variables include:</p>
<ol>
<li><strong>DVC_STUDIO_REPO_URL</strong>: Repository URL where DVC stores versioned data.</li>
<li><strong>DVC_STUDIO_TOKEN</strong>: Authentication token for secure access to DVC Studio.</li>
<li><strong>DVC_STUDIO_URL</strong>: Web interface URL for managing DVC projects.</li>
<li><strong>DVC_EXP_BASELINE_REV</strong>: Baseline revision for comparing experiment results.</li>
<li><strong>DVC_EXP_NAME</strong>: Descriptive identifier for the experiment.</li>
<li><strong>DVC_ROOT</strong>: Root directory of the DVC project on the filesystem.</li>
</ol>
<blockquote>
<p>💡 Note: All environment variables above are set by DVC automatically when
running a pipeline.</p>
</blockquote>
<p>You don’t need to care about DVC environment variables when running DVC in a
non-distributed environment. However, running it in Ray Cluster requires setting
up on every worker. In this solution, DVC environment variables are passed via
<a href="https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html#ray.runtime_env.RuntimeEnv" target="_blank" rel="nofollow noopener noreferrer">RuntimeEnv</a>
to specify a runtime environment for the whole job.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2a442f0d89f6c564d6a87f715ffcee7b/39600/12-env-vars.png" alt="Set up Environment Variables" title="Set up Environment Variables" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Set up DVC Environment Variables</em></p>
<p>The code snippet below demonstrates an approach to managing DVC environment
variables within a TorchTrainer setup.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment">#...</span>
<span class="token keyword">if</span> rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span>
live <span class="token operator">=</span> Live<span class="token punctuation">(</span>
<span class="token builtin">dir</span><span class="token operator">=</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"DVC_ROOT"</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"results/dvclive"</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>params<span class="token punctuation">:</span> <span class="token builtin">dict</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span>
<span class="token comment">#...</span>
trainer <span class="token operator">=</span> TorchTrainer<span class="token punctuation">(</span>
train_loop_per_worker<span class="token operator">=</span>train_func_per_worker<span class="token punctuation">,</span>
train_loop_config<span class="token operator">=</span>train_config<span class="token punctuation">,</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
<span class="token comment">#...</span>
<span class="token comment"># [1] Propogate DVC environment variables from Head Node to Workers</span>
<span class="token comment"># =============================================</span>
DVC_ENV_VARS <span class="token operator">=</span> <span class="token punctuation">{</span>k<span class="token punctuation">:</span> v <span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> k<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">"DVC"</span><span class="token punctuation">)</span><span class="token punctuation">}</span>
ray<span class="token punctuation">.</span>init<span class="token punctuation">(</span>runtime_env<span class="token operator">=</span>RuntimeEnv<span class="token punctuation">(</span>env_vars<span class="token operator">=</span>DVC_ENV_VARS<span class="token punctuation">)</span><span class="token punctuation">)</span>
train<span class="token punctuation">(</span>params<span class="token punctuation">)</span></code></pre></div>
<ul>
<li>To ensure that DVC environment variables are accessible within the training
loop across all worker nodes, <code>RuntimeEnv</code> propagates these variables from the
head node to the workers.</li>
</ul>
<h3 id="️-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository" style="position:relative;">☝️ Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project repository<a href="#%EF%B8%8F-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository" aria-label="️ copy the modelpth file from the ray trial folder to the dvc project repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Upon completing the training process, the <code>model.pth</code> file is saved in the Ray
Trial folder. Therefore, it’s copied to the DVC project repository (as shown in
the code example above).</p>
<p>This ensures that the trained model file is appropriately stored within the
DVC-managed project structure, facilitating version control and reproducibility.</p>
<h2 id="-summing-up-dvc--ray-integration" style="position:relative;">🎨 Summing Up: DVC + Ray Integration<a href="#-summing-up-dvc--ray-integration" aria-label=" summing up dvc ray integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The DVC + Ray integration presents a comprehensive solution to the challenges of
running machine learning experiments at scale. By addressing specific issues
related to auto-scaling, execution optimization, live metrics tracking, and data
synchronization, this setup ensures that machine learning teams can focus on
innovation and experimentation backed by a robust, scalable, and efficient
infrastructure.</p>
<p>In Part 1 of the tutorial, we explored the basics of setting up and integrating
DVC with Ray for distributed machine learning workflows. We covered the
following key topics:</p>
<ul>
<li><strong>Introduction to Ray</strong>: We discussed Ray’s capabilities for scaling AI and
Python applications, focusing on its ability to parallelize and distribute ML
workloads across multiple nodes easily.</li>
<li><strong>Ray Clusters</strong>: The architecture of Ray clusters was explained, highlighting
the roles of head and worker nodes in managing and executing tasks.</li>
<li><strong>Ray Tune and Ray Train</strong>: We introduced Ray Tune for hyperparameter
optimization and Ray Train for scaling model training code, emphasizing their
integration into ML workflows.</li>
<li><strong>Local Ray Cluster Setup</strong>: Step-by-step instructions were provided for
starting a Ray Cluster locally, showcasing how to test the setup with a simple
script.</li>
</ul>
<h3 id="key-takeaways" style="position:relative;">Key Takeaways<a href="#key-takeaways" aria-label="key takeaways permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The key takeaway from Part 1 is the foundation it sets for scalable and
efficient machine learning workflows. By leveraging Ray’s distributed computing
capabilities and DVC’s data version control, we establish a robust framework for
managing complex ML experiments. This combination enhances scalability,
reproducibility, and collaboration in ML projects.</p>
<h3 id="looking-ahead-to-part-2" style="position:relative;">Looking Ahead to Part 2<a href="#looking-ahead-to-part-2" aria-label="looking ahead to part 2 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In Part 2 of the tutorial, we will extend the solution to a Ray Cluster on AWS,
demonstrating how to adapt the setup for cloud-based distributed computing. This
will involve configuring AWS resources, deploying Ray clusters in the cloud, and
running DVC-managed pipelines at scale. The focus will shift towards managing
the increased complexity and leveraging cloud infrastructure to maximize the
efficiency and performance of ML experiments.</p>
<p>Stay tuned for detailed instructions on deploying and managing cloud-based Ray
clusters with DVC as we take the scalability and efficiency of ML workflows to
the next level.</p>
<blockquote>
<p>💡 Did you find this tutorial interesting? Please leave your comments and
share your experience with DVC and Ray! Join us on
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p>
</blockquote>
<h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://dvc.org/doc/studio/user-guide/experiments/explore-ml-experiments" target="_blank" rel="nofollow noopener noreferrer">DVC Studio: Explore ML Experiments</a></li>
<li><a href="https://docs.ray.io/en/latest/ray-overview/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs: Getting Started</a></li>
<li><a href="https://www.anyscale.com/blog/ray-common-production-challenges-for-generative-ai-infrastructure" target="_blank" rel="nofollow noopener noreferrer">How Ray solves common production challenges for Generative AI infrastructure</a></li>
<li><a href="https://medium.com/samsara-engineering/building-a-modern-machine-learning-platform-with-ray-eb0271f9cbcf" target="_blank" rel="nofollow noopener noreferrer">Building a Modern Machine Learning Platform with Ray</a></li>
</ul>https://dvc.org/blog/dvc-slurm-cluster-exscientiahttps://dvc.org/blog/dvc-slurm-cluster-exscientiaMon, 11 Mar 2024 00:00:00 GMT<h2 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For many ML projects, there comes a point when local development hits the wall
and we need to scale up the underlying compute resources. Maybe the dataset
grows too large for your primary workstation or the deep learning model requires
several high-end GPUs. This should be a routine transition for ML developers,
and one to which they shouldn’t have to give much thought. In this blog post,
we’ll explain our approach to remote DVC experiments on a SLURM cluster and
share some code to get you started.</p>
<p>We work at an AI-driven precision medicine company called
<a href="https://www.exscientia.ai/" target="_blank" rel="nofollow noopener noreferrer">Exscientia</a>. Our goal is to change the way the
world discovers and develops new medicines. The company is roughly evenly split
between biologists and chemists on one side and technologists on the other, with
your two authors belonging to the latter group; Dom is an AI research scientist
and Luis is an engineer. This context is important to understand why we
gravitated towards DVC in the first place, and why we scaled it up the way we
did.</p>
<h2 id="why-dvc-on-slurm" style="position:relative;">Why DVC on SLURM?<a href="#why-dvc-on-slurm" aria-label="why dvc on slurm permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As demonstrated in
<a href="https://en.wikipedia.org/wiki/Accelerate_(book)" target="_blank" rel="nofollow noopener noreferrer">research undertaken by the DevOps movement</a>,
it’s hard to maintain consistent software delivery without well-designed tooling
(like CI/CD) and a conducive developer culture (like PRs or working in small
batches). Our domain is highly specific, but the same principles apply: to move
fast while maintaining high quality, reliability and reproducibility, we need to
adopt best DevOps practices. There are only so many hours in a day and you want
to spend all of them on trying out new ideas and ideally none on setting up
infrastructure. Good tooling optimises scientists’ efficiency and lets them run
more experiments, each more thorough and exhaustive than would otherwise have
been possible – all this while maintaining control over research code bases
which can, if left unchecked, turn into precarious Jenga towers. Predictable
code with clear standards also eases collaboration, the lifeblood of science.
Consequently it’s much more important to pick an arbitrary standard than to
obsess over any particular detail.</p>
<p>At Exscientia we provide researchers with project templates that automatically
set up version control and CI/CD as well as QA tooling like Black, Ruff and
Mypy. To coherently extend this setup to the joint realms of data science and
ML, we integrated DVC. Our scientists can set up a fresh DVC-enabled repository
with all the productivity tooling in just a few keystrokes and start
experimenting right away. And because DVC transparently extends Git, there is
less tool-induced context switching: users are always dealing with Git in some
shape or form, rather than Git (for the code) and a database hidden behind a web
service (for all the rest of it). Less context switching translates to less
frustration and more flow.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 681px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2bafb54e30355ed5d332a518a1e417b1/39600/high-quality-reliability-reproducibility.png" alt="High quality, reliability, and reproducibility" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>To maintain a frictionless developer experience even as model sizes grow beyond
the means of the humble laptop, we surveyed the organisation’s entire
computational estate with a view towards designing an effective developer
experience. Our platforms must support a number of teams with on-demand Jupyter
or RStudio instances as well as workflow orchestration engines. We need to run
large unsupervised jobs, interactive analyses and development sessions across
many domains and technologies: data processing, ML model training and chemical
simulations, each with different resource requirements. Finally, submitting a
large workload should be a smooth and routine experience.</p>
<p>In the end, a cloud-deployed SLURM cluster fit the bill. It can efficiently
scale compute resources while maintaining a user-friendly interface for job
submission. As a bonus, many of our users are already familiar with SLURM from
their past lives in academia. The principal mode of interaction is very simple:
the user submits a Bash script describing exactly what they want to happen,
including the exact resources required. SLURM will wait until such resources are
available and then execute the job as instructed. Thanks to this highly general
interface, the same computational resource, and its administrators, can support
very diverse groups of users at the same time, reducing infrastructural
complexity across the organisation.</p>
<h2 id="a-sample-project" style="position:relative;">A sample project<a href="#a-sample-project" aria-label="a sample project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We’ll set up a <a href="https://github.com/Exscientia/rdvc-demo-project" target="_blank" rel="nofollow noopener noreferrer">basic project</a>
for this demo and, to keep with the drug discovery theme, we will be predicting
solubility of chemical compounds in water using only our recently open-sourced
framework MolFlux.</p>
<p>The DVC pipeline consists of a featurisation stage, which loads the “ESOL”
dataset consisting of pairs of molecules and their aqueous solubilities - how
easily a molecule dissolves in water.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bd47a8d8d16070d3da90c7c38a329d45/39600/stages.png" alt="Stages" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>A few words about molecules and neural networks. Cheminformatics typically
represents molecules as graphs, with atoms acting as the nodes and chemical
bonds as the edges. There are several ways to feed molecular data to neural
networks, each with its own pros and cons. GNNs can act directly on the
molecular graph. You can also represent the graph as a string (most commonly
using the SMILES format) and feed it to any sequence model such as a
transformer.</p>
<p>In this example we’ll use a classic cheminformatics transformation called ECFP,
or
<a href="https://pubs.acs.org/doi/10.1021/ci100050t" target="_blank" rel="nofollow noopener noreferrer">extended connectivity fingerprint</a>.
It’s essentially analogous to n-grams in NLP, which track whether a particular
sequence of tokens appears in a text document. For example, does the 3-letter
sequence “wea” appear in the Wikipedia article on blazers? Indeed it does, as
part of “wear”.</p>
<p>Returning to ECFPs defined on molecular graphs, each “n-gram” is an atom and its
immediate (e.g. 2-hop) neighbourhood. Since the “vocabulary” of all possible
“n-grams” is finite, we can associate to each molecule a finite bit-vector (of
the same length as the vocabulary) such that the choice of 0 or 1 indicates
whether the corresponding “n-gram” is present in the molecule. This bit-vector
is the ECFP fingerprint. And since it has a constant length, we can feed it into
a large variety of ML algorithms, such as the MLP in the training stage.</p>
<p>We use DVC to configure and run the pipeline, decoupling the data featurisation
step (where we convert molecules to ECFPs) from the model training step.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d522f9d2e40d12270de0689ed2a6a0ae/39600/stages-dvcyaml.png" alt="DVC Stage Spec" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>DVC pipelines are useful to organise projects. As they are versioned in Git, you
can reproduce complete workflows and results. Running a new experiment is a
command away:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>This executes and tracks experiments in your repository without polluting it
with unnecessary Git commits, branches, directories, etc. For more information
and examples, see the
<a href="https://dvc.org/doc/command-reference/exp/run" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a>.</p>
<p>It may not be immediately obvious, but our setup is highly modular. Head over to
<code>src/rdvc_demo_project/config/main.yaml</code> to see just an example of configuration
options we can tweak for each individual experiment. To start a much longer
training run, execute</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">model.config.trainer.max_epochs</span><span class="token operator">=</span><span class="token number">100</span></span></code></pre></div>
<p>MolFlux was built to be explicitly config-driven and DVC’s
<a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra integration</a>
exposes all of that flexibility out of the box.</p>
<h2 id="in-the-cloud" style="position:relative;">In the cloud<a href="#in-the-cloud" aria-label="in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now that DVC experiments run on our local machine, we’d like to move them to the
SLURM cluster. In this second repository, we share the source code to an
internal tool we call <a href="https://github.com/exs-dmiketa/rdvc" target="_blank" rel="nofollow noopener noreferrer">rDVC</a> (for <em>remote</em>
DVC). It is, by design, a very thin layer around <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> and accepts all
of its options and arguments. But on top of that it also recognises many of
<a href="https://slurm.schedmd.com/sbatch.html" target="_blank" rel="nofollow noopener noreferrer"><code>sbatch</code> arguments and flags</a>, allowing
it to control which computational resource inside the cluster will be used and
for how long. For a full list of options consult <code>rdvc run –help</code>.</p>
<p>Let’s demonstrate how it works.</p>
<p>On its own, DVC knows nothing about your remote cluster, so we’ll need to start
with a small amount of setup. Make sure you have cloned the sample project repo
and installed the Python virtual environment using <code>init_python_venv.sh</code>. You
will initialise your local rDVC config with</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init project</span></code></pre></div>
<p>Follow the wizard to set up default options for this project’s remote runs; they
will be found in <code>.rdvc/config.toml</code> inside of the project repository. Depending
on the cluster’s setup, you may be able to choose the <em>instance type</em> allocated
to your job. For the demo we have configured the cluster with t3.xlarge,
g5.xlarge and g5.12xlarge. Our internal version of rDVC supports many more
instance types and we encourage you to fork rDVC, redefine supported instance
types and make the tool your own. For this demo, we pick g5.xlarge as the
default instance as we want access to the GPU. But let’s continue with the demo.
To point rDVC at your SLURM cluster, we’ll run the global initialisation script
next:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init global</span></code></pre></div>
<p>rDVC now knows how to contact SLURM, so let’s finish with configuration of the
remote server:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init remote</span></code></pre></div>
<p>Nothing stands between us and a remote GPU-powered experiment! Since rDVC is in
many ways just a wrapper around <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, we can easily set off a run with
modified parameters as</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> run <span class="token parameter variable">-S</span> <span class="token assign-left variable">fabric</span><span class="token operator">=</span>gpu</span></code></pre></div>
<p>When your run is finished you can pull it to your local repository with</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span> origin</span></code></pre></div>
<p>and look at the results.</p>
<h2 id="behind-the-scenes" style="position:relative;">Behind the scenes<a href="#behind-the-scenes" aria-label="behind the scenes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>rDVC compiled a SLURM batch (or “sbatch”) script containing these instructions:</p>
<ol>
<li>Clone the project repo</li>
</ol>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token shebang important">#!/bin/bash</span>
<span class="token comment">#SBATCH --output=".rdvc/logs/slurm-%j.out"</span>
<span class="token comment">#SBATCH --job-name=rdvc-run:rdvc-demo-project:main</span>
<span class="token comment">#SBATCH --wckey=rdvc-demo-project</span>
<span class="token comment">#SBATCH --mail-type=END,FAIL</span>
<span class="token comment">#SBATCH --mail-user=<[email protected]></span>
<span class="token comment">#SBATCH --constraint=t3.xlarge</span>
<span class="token comment">#SBATCH --cpus-per-task=2</span>
<span class="token comment">#SBATCH --nodes=1</span>
<span class="token comment">#SBATCH --exclusive</span>
<span class="token comment"># Ensure bashrc is loaded</span>
<span class="token builtin class-name">source</span> <span class="token string">"<span class="token variable">${<span class="token environment constant">HOME</span>}</span>/.bashrc"</span>
<span class="token comment"># Exit on failure http://redsymbol.net/articles/unofficial-bash-strict-mode/</span>
<span class="token builtin class-name">set</span> <span class="token parameter variable">-euxo</span> pipefail
<span class="token assign-left variable"><span class="token environment constant">IFS</span></span><span class="token operator">=</span><span class="token string">$'<span class="token entity" title="\n">\n</span><span class="token entity" title="\t">\t</span>'</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_NAME</span><span class="token operator">=</span><span class="token string">"rdvc-demo-project"</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_URL</span><span class="token operator">=</span><span class="token string">"[email protected]:<user>/rdvc-demo-project.git"</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_BRANCH</span><span class="token operator">=</span><span class="token string">"main"</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_REV</span><span class="token operator">=</span><span class="token string">"<git_hash>"</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_DIR</span><span class="token operator">=</span><span class="token string">"<span class="token variable">${RDVC_DIR<span class="token operator">:-</span>${<span class="token environment constant">HOME</span>}</span>/.rdvc}"</span>
<span class="token comment"># Prepare a directory for the current job</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_WORKSPACE_DIR</span><span class="token operator">=</span><span class="token string">"/tmp/rdvc-<span class="token variable">${SLURM_JOB_ID}</span>"</span>
<span class="token function">mkdir</span> <span class="token parameter variable">-p</span> <span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>"</span>
<span class="token comment"># Ensure cleanup after job finishes, regardless of exit status</span>
<span class="token keyword">function</span> <span class="token function-name function">cleanup_job_dir</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Cleaning up the job directory."</span>
<span class="token function">rm</span> <span class="token parameter variable">-rf</span> <span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>"</span>
<span class="token punctuation">}</span>
<span class="token builtin class-name">trap</span> cleanup_job_dir EXIT
<span class="token comment"># Create an insulated Git workspace for the current job</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Creating Git workspace."</span>
<span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_DIR</span><span class="token operator">=</span><span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>/<span class="token variable">${RDVC_JOB_REPO_NAME}</span>"</span>
<span class="token function">git</span> clone <span class="token parameter variable">--branch</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_BRANCH}</span>"</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_URL}</span>"</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_DIR}</span>"</span>
<span class="token builtin class-name">cd</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_DIR}</span>"</span> <span class="token operator">||</span> <span class="token builtin class-name">exit</span>
<span class="token comment"># Ensure the job runs on the same revision as was submitted (even if the branch has moved on in the meantime)</span>
<span class="token function">git</span> checkout <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_REV}</span>"</span></code></pre></div>
<ol start="2">
<li>Install the Python virtual environment with <code>init_python_venv.sh</code></li>
</ol>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Install Python environment</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Install Python environment."</span>
./init_python_venv.sh
<span class="token builtin class-name">echo</span> <span class="token string">"Activate Python environment."</span>
<span class="token builtin class-name">source</span> ./.venv/bin/activate
<span class="token comment"># Setup links for the DVC cache shared among jobs and projects</span>
dvc config <span class="token parameter variable">--local</span> cache.type hardlink,symlink,copy
<span class="token comment"># Push results of experiments even if job fails</span>
<span class="token keyword">function</span> <span class="token function-name function">cleanup_dvc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span>
<span class="token keyword">if</span> <span class="token punctuation">[</span> <span class="token string">"<span class="token variable">$1</span>"</span> <span class="token operator">!=</span> <span class="token string">"0"</span> <span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token keyword">then</span>
<span class="token comment"># Push cache of all runs, including failed</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Job failed. Pushing run cache."</span>
dvc push --run-cache
<span class="token keyword">else</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Job successfully finished."</span>
<span class="token keyword">fi</span>
deactivate
cleanup_job_dir
<span class="token punctuation">}</span>
<span class="token builtin class-name">trap</span> <span class="token string">'cleanup_dvc $?'</span> EXIT</code></pre></div>
<ol start="3">
<li>Execute dvc exp run -S fabric=gpu</li>
</ol>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_EXP_RUN_OPTIONS_STRING</span><span class="token operator">=</span><span class="token string">"-S fabric=gpu"</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Executing DVC experiment."</span>
<span class="token builtin class-name">eval</span> <span class="token string">"dvc exp run --pull --allow-missing <span class="token variable">${RDVC_JOB_EXP_RUN_OPTIONS_STRING}</span>"</span></code></pre></div>
<ol start="4">
<li>Push the experiment to the remote</li>
</ol>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Push experiment to the remote and update the repository</span>
<span class="token builtin class-name">echo</span> <span class="token string">"Pushing DVC experiment to Git and DVC remotes."</span>
dvc exp push <span class="token variable">$RDVC_JOB_REPO_URL</span></code></pre></div>
<p>This script is submitted to the cluster over SSH. You can view it in
<code>~/.rdvc/submissions</code>.</p>
<p>And that’s it! It’s so simple you could do it manually in an interactive SLURM
session - and that happens to be a good way to debug issues. If your job fails,
first consult its log over at <code>~/.rdvc/logs</code> and then try to reproduce the
submission script from an interactive session.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We shared two repositories: a simple DVC project and a tool for remote execution
on SLURM clusters. The latter is universal - it knows nothing about the
project! - and easily hackable. We highly recommend to fork and customise it to
your team’s needs.</p>https://dvc.org/blog/automate-data-validation-and-model-monitoring-with-evidently-and-dvchttps://dvc.org/blog/automate-data-validation-and-model-monitoring-with-evidently-and-dvcFri, 19 Jan 2024 00:00:00 GMT<p><em>Feel free to clone the repository provided. It's more than a learning tool;
it's a flexible reference architecture that you can adapt to fit your unique use
cases.</em></p>
<h2 id="why-dvc-and-evidently" style="position:relative;">Why DVC and Evidently?<a href="#why-dvc-and-evidently" aria-label="why dvc and evidently permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the realm of Machine Learning Operations (MLOps), ensuring the robustness and
reliability of models is paramount. Using the right tools can significantly
enhance your MLOps practices.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1158f2e1f91f438b80df0406fe0c1aaf/39600/2-mlops-workflow.png" alt="Typical Machine Learning Operations (MLOps) workflow" title="Typical Machine Learning Operations (MLOps) workflow" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Typical Machine Learning Operations (MLOps) workflow</em></p>
<p><strong><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a></strong> is an open-source tool that brings agility and
reproducibility to data science projects by treating data and model training
pipelines as software. It connects versioned data sources and code with
pipelines, track experiments, register models — all based on GitOps principles.</p>
<p><strong><a href="https://github.com/evidentlyai/evidently" target="_blank" rel="nofollow noopener noreferrer">Evidently</a></strong> is an open-source
Python library to evaluate, test, and
<a href="https://www.evidentlyai.com/ml-in-production/model-monitoring" target="_blank" rel="nofollow noopener noreferrer">monitor ML models</a>.
It has 100+ built-in metrics and tests on data quality, data drift, and model
performance and helps interactively visualize them.</p>
<p>When used together, DVC and Evidently tools offer a comprehensive solution for
training, predicting, and monitoring ML models.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3300f3ab904f1e6f56d45ce0fc52a3d7/39600/3-dvc-evidently-features.png" alt="Core features of DVC and Evidently for MLOps practices" title="Core features of DVC and Evidently for MLOps practices" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Core features of DVC and Evidently for MLOps practices</em></p>
<blockquote>
<p>💡 <strong>Want to learn more about DVC and Evidently?</strong></p>
<ul>
<li><a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools for Data Scientists & Analysts course</a>
with DVC</li>
<li><a href="https://www.evidentlyai.com/ml-observability-course" target="_blank" rel="nofollow noopener noreferrer">Open-source ML observability course</a>
with Evidently</li>
</ul>
</blockquote>
<h2 id="tutorial-scope" style="position:relative;">Tutorial scope<a href="#tutorial-scope" aria-label="tutorial scope permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This tutorial teaches you how to build DVC pipelines for training and monitoring
jobs, parse Evidently reports, and version reference datasets.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6983ae2db58b2ee1742db44917f93659/39600/4-example-pipelines.png" alt="Pipelines and artifacts of the example project*" title="Pipelines and artifacts of the example project*" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Pipelines and artifacts of the example project</em></p>
<p>By the end of this tutorial, you will learn how to implement an ML monitoring
architecture using:</p>
<ul>
<li><a href="https://www.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently</a> to perform data quality, data drift,
and model quality checks.</li>
<li><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to run monitoring jobs and version monitoring
artifacts</li>
<li><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> to save monitoring metrics from Python
scripts and visualize in VS Code.</li>
</ul>
<p>Using a Python virtual environment, you can run the example on a local machine.</p>
<h3 id="dataset-sales-forecasting" style="position:relative;">Dataset: Sales Forecasting<a href="#dataset-sales-forecasting" aria-label="dataset sales forecasting permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Dataset.</strong> You will be diving into a
<a href="https://www.kaggle.com/c/bike-sharing-demand/data" target="_blank" rel="nofollow noopener noreferrer">Kaggle dataset</a> focused on
Bike Sharing Demand. The goal is to predict hourly bike rental volumes.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f457eae2892f8cf155a481f3167c5c11/39600/5-tutorial-1-model-analytics-in-production.png" alt="Source: https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" title="Source: https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Source:
<a href="https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production</a></em></p>
<p><strong>ML Application.</strong> Use historical usage and weather data to predict bike rental
demand. Essential for operational efficiency and customer service.</p>
<p>Similar applications:</p>
<ul>
<li>Applicable in sectors like retail, transportation, and energy for demand
prediction.</li>
<li>Ensures models stay relevant and effective despite changing data patterns.</li>
</ul>
<h3 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We expect that you:</p>
<ul>
<li>Have learned the for DVC by following the
<a href="https://dvc.org/doc/start#get-started-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Get Started with DVC</a> guide</li>
<li>Went through the
Evidently <a href="https://docs.evidentlyai.com/get-started/tutorial/?utm_source=website&utm_medium=referral&utm_campaign=blog_text&utm_content=batch-ml-monitoring-architecture" target="_blank" rel="nofollow noopener noreferrer">Get Started Tutorial</a> and
can generate visual and JSON Reports with Metrics.</li>
</ul>
<p>To follow this tutorial, you'll need the following tools installed on your local
machine:</p>
<ul>
<li>Python version 3.11 or above</li>
<li>Git</li>
<li>VS Code and
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></li>
</ul>
<blockquote>
<p>💡 Note: we tested this example on macOS/Linux.</p>
</blockquote>
<h2 id="-installation" style="position:relative;">👩💻 Installation<a href="#-installation" aria-label=" installation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First, install the pre-built example. Check the origin README file for more
technical details and notes.</p>
<p><strong>1. Fork / Clone this repository</strong></p>
<p>Clone the GitHub repository with the example code. This repository provides the
necessary files and scripts for setting up the integration between Evidently and
DVC.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">git</span> clone https://github.com/iterative/evidently-dvc.git
$ <span class="token builtin class-name">cd</span> evidently-dvc</code></pre></div>
<p><strong>2. Install Python dependencies</strong></p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python3 <span class="token parameter variable">-m</span> venv .venv
$ <span class="token builtin class-name">echo</span> <span class="token string">"export PYTHONPATH=<span class="token environment constant">$PWD</span>"</span> <span class="token operator">>></span> .venv/bin/activate
$ <span class="token builtin class-name">source</span> .venv/bin/activate
$ pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</code></pre></div>
<blockquote>
<p>💡 Note: To ensure everything runs smoothly, please make sure to execute all
the code examples provided below within an activated virtual environment.</p>
</blockquote>
<h2 id="-run-ml-monitoring-example" style="position:relative;">🚀 Run ML monitoring example<a href="#-run-ml-monitoring-example" aria-label=" run ml monitoring example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, let’s launch the pre-built example to run monitoring pipelines and manage
monitoring artifacts using DVC and Evidently.</p>
<h3 id="1-running-the-train-pipeline" style="position:relative;">1. Running the <code>train</code> pipeline<a href="#1-running-the-train-pipeline" aria-label="1 running the train pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To run the entire pipeline, execute a simple command in your terminal. Make sure
you're in the project's root directory:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc exp run pipelines/train/dvc.yaml</code></pre></div>
<p>This command runs the stages defined in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file located in
<code>pipelines/train</code>. DVC experiments allow you to track changes made during each
run, making it easier to iterate and improve your model. Here’s what happens in
each stage:</p>
<ul>
<li><strong>load_data</strong>:
<ul>
<li>Downloads and unzips the dataset into your <code>data/</code> directory.</li>
</ul>
</li>
<li><strong>extract_data</strong>:
<ul>
<li>Executes <code>src/stages/extract_data.py</code>, using parameters from
<code>pipelines/train/params.yaml</code>.</li>
<li>Outputs training and testing datasets to specified paths.</li>
</ul>
</li>
<li><strong>train</strong>:
<ul>
<li>Runs <code>train.py</code>, training the model with the training data.</li>
<li>Saves the model to <code>models/model.joblib</code></li>
</ul>
</li>
<li><strong>evaluate</strong>:
<ul>
<li>Runs <code>evaluate.py</code> to assess the model on the test data.</li>
<li>Outputs reference data for monitoring to <code>data/reference_data.csv</code>.</li>
<li>Builds the model performance report using Evidently Regression Preset and
saves it to <code>reports/train/model_performance.html</code>.</li>
<li>Saves metrics to <code>reports/train/metrics.json</code>.</li>
</ul>
</li>
</ul>
<p>After the pipeline is complete, you can</p>
<ul>
<li>(1) visualize training metrics
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for Visual Studio Code</a>
,</li>
<li>(2) open the detailed model performance HTML report built with Evidently in
the browser.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/47bdc4d5ee7cfd053bf17a9debffbdf8/39600/6-metrics-and-reports.png" alt="Metrics and reports for Training pipeline" title="Metrics and reports for Training pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Metrics and reports for Training pipeline</em></p>
<blockquote>
<p>💡 Note: Make sure you have the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for Visual Studio Code</a>
installed.</p>
</blockquote>
<h3 id="2-running-the-predict-pipeline" style="position:relative;">2. Running the <code>predict</code> pipeline<a href="#2-running-the-predict-pipeline" aria-label="2 running the predict pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Once your model is trained and evaluated, the next vital step is to perform
predictions on new data. To run the pipeline, execute the following command in
your terminal:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc repro pipelines/predict/dvc.yaml</code></pre></div>
<p>Here’s what happens in each stage:</p>
<ul>
<li><strong>predict</strong>:
<ul>
<li>Executes <code>src/stages/predict.py</code>, using parameters from
<code>pipelines/predict/params.yaml</code>.</li>
<li>Saves predictions to a CSV file, formatted as
<code>data/predictions/${predict.week_start}--${predict.week_end}.csv</code>.
Parameters <code>week_start</code> and <code>week_end</code> are located in the corresponding
<code>params.yaml</code> file.</li>
</ul>
</li>
</ul>
<p>DVC automatically starts versioning control for the saved CSV file. You can now
push the data to remote storage in Clouds.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/19ac13afb82e9d61ad06fe7726b3573f/39600/7-artifacts-versioned-with-dvc.png" alt="Managing prediction datasets with DVC" title="Managing prediction datasets with DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Managing prediction datasets with DVC</em></p>
<blockquote>
<p>💡 Note: You may find more features in scenarios for
<a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">Data Management with DVC</a>
in docs.</p>
</blockquote>
<h3 id="3-run-monitor-pipeline" style="position:relative;">3. Run <code>monitor</code> pipeline<a href="#3-run-monitor-pipeline" aria-label="3 run monitor pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The monitor pipeline consists of two key stages: <code>monitor_data</code> and
<code>monitor_model</code>. These stages are crucial for ensuring your machine learning
models' ongoing health and performance.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc repro pipelines/monitor/dvc.yaml</code></pre></div>
<p>Here’s what happens in each stage:</p>
<ul>
<li><strong>monitor_data:</strong>
<ul>
<li>This stage is responsible for monitoring data quality and detecting any data
drifts.</li>
<li>Executes <code>src/stages/monitor_data.py</code> with configuration parameters from
<code>pipelines/monitor/params.yaml</code>.</li>
<li>Produces HTML reports for data drift and data quality, and stores them in a
directory named as<code>reports/{predict.week_start}--${predict.week_end}</code>.</li>
</ul>
</li>
<li><strong>monitor_model:</strong>
<ul>
<li>Focuses on monitoring the performance of the model and detecting any target
drifts</li>
<li>Executes <code>src/stages/monitor_model.py</code> with configuration parameters from
<code>pipelines/monitor/params.yaml</code>.</li>
<li>Generates HTML reports for model performance and target drift, saved in the
specified monitoring reports directory names as
<code>reports/{predict.week_start}--${predict.week_end}</code>.</li>
</ul>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/836c14f61a048464283c543c2f53370a/39600/8-evidently-reports.png" alt="Model Performance and Data Validation reports" title="Model Performance and Data Validation reports" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Model Performance and Data Validation reports</em></p>
<h2 id="-data-validation-and-model-monitoring-with-evidently" style="position:relative;">📈 Data Validation and Model Monitoring with Evidently<a href="#-data-validation-and-model-monitoring-with-evidently" aria-label=" data validation and model monitoring with evidently permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, let’s explore how Evidently works internally as a part of an ML model
monitoring architecture.</p>
<h3 id="metrics-and-reports" style="position:relative;">Metrics and Reports<a href="#metrics-and-reports" aria-label="metrics and reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The idea behind Evidently is very simple: it calculates a bunch of metrics and
organizes them into nice reports. Reports are the most effective way to analyze
and debug your models and data visually. You may save reports as HTML files,
JSON snapshots, or export the metrics externally by parsing JSON or Python
dictionary outputs. This allows you to apply Evidently for multiple validation
and monitoring scenarios in
<a href="https://evidentlyai.com/blog/fastapi-tutorial" target="_blank" rel="nofollow noopener noreferrer">real-time</a> and
<a href="https://www.evidentlyai.com/blog/batch-ml-monitoring-architecture" target="_blank" rel="nofollow noopener noreferrer">batch-scoring</a>
ML applications:</p>
<ul>
<li>save monitoring reports in HTML files and use them to analyze and debug your
models and data,</li>
<li>get values for specific metrics, and log them to external databases (like
PostgreSQL) and dashboarding tools (like Grafana),</li>
<li>save monitoring reports (as snapshots) in JSON files over time and run an
<a href="https://docs.evidentlyai.com/user-guide/monitoring/monitoring_overview" target="_blank" rel="nofollow noopener noreferrer">Evidently Monitoring Dashboard</a>
for continuous monitoring.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7428bfdbfbd449b82ab6e881b38f4505/39600/9-evidently.png" alt="Source: https://docs.evidentlyai.com/ " title="Source: https://docs.evidentlyai.com/ " loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Source: <a href="https://docs.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">https://docs.evidentlyai.com/</a></em></p>
<p>If you choose to use HTML and JSON files, you need a way to store and version
them. In the following section of the tutorial, we will explore how DVC can
assist with this.</p>
<h3 id="data-requirements" style="position:relative;">Data Requirements<a href="#data-requirements" aria-label="data requirements permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To calculate metrics monitoring reports with Evidently, you typically need <strong>two
datasets</strong>:</p>
<ul>
<li><strong>Reference</strong> dataset is a baseline for comparison or an exemplary dataset
that helps generate test conditions. This can be training data or earlier
production data. (from
<a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">docs</a>)</li>
<li><strong>Current</strong> dataset is the dataset you want to evaluate. It can include the
most recent production data. (from
<a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">docs</a>)</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c980fb2bf7a2d0daf0948a146024cb93/39600/10-evidently-datasets.png" alt="Original image: https://docs.evidentlyai.com/user-guide/input-data/data-requirements " title="Original image: https://docs.evidentlyai.com/user-guide/input-data/data-requirements " loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Original image:
<a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">https://docs.evidentlyai.com/user-guide/input-data/data-requirements</a></em></p>
<p>In this tutorial, the reference dataset is a sample extracted from the training
dataset. It helps to automatically generate a reference during the training and
align the version of the reference dataset and a model.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># src/stages/evaluate.py</span>
reference_data <span class="token operator">=</span> train_data<span class="token punctuation">.</span>sample<span class="token punctuation">(</span>frac<span class="token operator">=</span><span class="token number">0.3</span><span class="token punctuation">)</span></code></pre></div>
<h2 id="-automate-data-and-monitoring-pipelines-with-dvc" style="position:relative;">📈 Automate Data and Monitoring Pipelines with DVC<a href="#-automate-data-and-monitoring-pipelines-with-dvc" aria-label=" automate data and monitoring pipelines with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This section will guide you through the design and implementation of monitoring
pipelines, providing insights for the next improvements and customization.</p>
<h3 id="separate-dvc-pipelines" style="position:relative;">Separate DVC pipelines<a href="#separate-dvc-pipelines" aria-label="separate dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In the tutorial example, we tried to achieve the following ML system design
principles:</p>
<ul>
<li><strong>Modular Design</strong>: Each stage of the ML workflow, such as data preparation,
model training, and monitoring, is encapsulated in separate DVC pipelines.
This modular approach enhances maintainability and scalability.</li>
<li><strong>Pipeline Independence</strong>: These pipelines can be run independently, which
allows for flexibility in execution and troubleshooting. In a typical
scenario, training, inference, and monitoring pipelines run independently at
different time intervals and environments.</li>
<li><strong>Reusability</strong>: By separating the pipelines, you can easily reuse components
across different projects or stages of the same project.</li>
</ul>
<p>As a result, the tutorial example has three pipelines for training, prediction
inference, and monitoring. DVC allows you to have multiple <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files to
configure and run pipelines.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7ae9f4cbd0b609a6d4c7d29cd0eb12cf/39600/10-pipelines-dir.png" alt="Pipelines Directory Structure" title="Pipelines Directory Structure" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Pipelines Directory Structure</em></p>
<p>Let’s explore an excerpt from the <code>pipelines/monitor/dvc.yaml</code> to discuss a few
“advanced” configuration features you may find useful:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">PIPELINE_DIR</span><span class="token punctuation">:</span> pipelines/monitor
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">monitor_data</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/monitor_data.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config=$<span class="token punctuation">{</span>PIPELINE_DIR<span class="token punctuation">}</span>/params.yaml
<span class="token key atrule">wdir</span><span class="token punctuation">:</span> ../..
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>PIPELINE_DIR<span class="token punctuation">}</span>/params.yaml<span class="token punctuation">:</span>
<span class="token punctuation">-</span> predict
<span class="token punctuation">-</span> monitoring
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> src/stages/monitor_data.py
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>predict.predictions_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>.csv
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>monitoring.reports_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>monitoring.data_drift_path<span class="token punctuation">}</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>monitoring.reports_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>monitoring.data_quality_path<span class="token punctuation">}</span></code></pre></div>
<ul>
<li>☝️ <strong>Using <code>vars</code>:</strong>
<ul>
<li>Variables (<code>vars</code>) in DVC define values that can be reused across the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. It makes complex <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files more readable and easier
to update.</li>
<li>In this example, <code>PIPELINE_DIR</code> is used to specify the pipeline directory in
the project repository. You may reference this variable using the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#templating" target="_blank" rel="nofollow noopener noreferrer">templating</a>
format to insert values like <code>${PIPELINE_DIR}</code>.</li>
</ul>
</li>
<li>☝️ <strong>Using <code>wdir</code>:</strong>
<ul>
<li>The <code>wdir</code> (working directory) key in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> sets the directory context
for running the commands defined in a stage. Allows you to use relative
paths for dependencies (<code>deps</code>), outputs (<code>outs</code>), and scripts within that
directory.</li>
<li>In this example, <code>wdir: ../..</code> points to the repository root. So, paths in
<code>deps</code> and <code>outs</code> are easier to read and maintain.</li>
</ul>
</li>
<li>☝️ <strong>Using separate <code>params.yaml</code>:</strong>
<ul>
<li>The <code>params.yaml</code> file holds parameters, and DVC allows it to have multiple
ones.</li>
<li>This example has separate <code>params.yaml</code> file for each pipeline. To let DVC
understand which file to use, we specify the full path to the <code>params.yaml</code>
using the <code>PIPELINE_DIR</code> variable.</li>
</ul>
</li>
</ul>
<h3 id="storing-monitoring-configuration-in-paramsyaml" style="position:relative;">Storing monitoring configuration in <code>params.yaml</code><a href="#storing-monitoring-configuration-in-paramsyaml" aria-label="storing monitoring configuration in paramsyaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In some monitoring scenarios, you may have parameterized pipelines. Using DVC
you may find it useful to reuse <code>params.yaml</code> file to configure the monitoring
pipeline. This brings a few benefits:</p>
<ul>
<li><strong>Ease of Modification</strong>: You can quickly adjust the pipeline's behavior by
modifying the parameters in this file, such as changing the data source or
tuning model parameters.</li>
<li><strong>Version Control for Parameters</strong>: Since <code>params.yaml</code> is under Git version
control, changes in configurations are tracked by Git, ensuring
reproducibility and transparency in your pipeline's evolution.</li>
</ul>
<p>Let’s explore <code>pipelines/monitor/params.yaml</code></p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">---</span>
<span class="token key atrule">data</span><span class="token punctuation">:</span>
<span class="token key atrule">predict_data</span><span class="token punctuation">:</span> data/test.csv
<span class="token key atrule">target_col</span><span class="token punctuation">:</span> cnt
<span class="token key atrule">prediction_col</span><span class="token punctuation">:</span> prediction
<span class="token key atrule">numerical_features</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">'temp'</span><span class="token punctuation">,</span> <span class="token string">'atemp'</span><span class="token punctuation">,</span> <span class="token string">'hum'</span><span class="token punctuation">,</span> <span class="token string">'windspeed'</span><span class="token punctuation">,</span> <span class="token string">'hr'</span><span class="token punctuation">,</span> <span class="token string">'weekday'</span><span class="token punctuation">]</span>
<span class="token key atrule">categorical_features</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">'season'</span><span class="token punctuation">,</span> <span class="token string">'holiday'</span><span class="token punctuation">,</span> <span class="token string">'workingday'</span><span class="token punctuation">]</span>
<span class="token key atrule">predict</span><span class="token punctuation">:</span>
<span class="token key atrule">model_path</span><span class="token punctuation">:</span> models/model.joblib
<span class="token key atrule">week_start</span><span class="token punctuation">:</span> <span class="token string">'2011-01-29'</span>
<span class="token key atrule">week_end</span><span class="token punctuation">:</span> <span class="token string">'2011-02-04'</span>
<span class="token key atrule">predictions_dir</span><span class="token punctuation">:</span> data/predictions
<span class="token key atrule">monitoring</span><span class="token punctuation">:</span>
<span class="token key atrule">reports_dir</span><span class="token punctuation">:</span> reports
<span class="token key atrule">reference_data</span><span class="token punctuation">:</span> data/reference_data.csv
<span class="token comment"># for monitor_model</span>
<span class="token key atrule">model_performance_path</span><span class="token punctuation">:</span> model_performance.html
<span class="token key atrule">target_drift_path</span><span class="token punctuation">:</span> target_drift.html
<span class="token comment"># for monitor_data</span>
<span class="token key atrule">data_drift_path</span><span class="token punctuation">:</span> data_drift.html
<span class="token key atrule">data_quality_path</span><span class="token punctuation">:</span> data_quality.html</code></pre></div>
<ul>
<li>☝️ <strong>List features to be included in monitoring reports:</strong>
<ul>
<li><code>target_col</code> and <code>prediction_col</code> define the names of the target and
prediction columns,</li>
<li><code>numerical_features</code> and <code>categorical_features</code> define feature names for
monitoring purposes. This could be especially beneficial for data monitoring
and data drift reports.</li>
</ul>
</li>
<li>☝️ <strong>Parametrized data samples:</strong>
<ul>
<li><code>week_start</code> and <code>week_end</code> define the time frame for which predictions are
generated. This example can be modified to support other approaches for data
extraction.</li>
</ul>
</li>
<li>☝️ <strong>Specify a reference dataset:</strong>
<ul>
<li><code>reference_data</code> specifies a path to the reference dataset used in
monitoring.</li>
<li>You may have multiple reference datasets and select among them to generate
reports.</li>
</ul>
</li>
<li>☝️ <strong>Specify the location to store monitoring artifacts:</strong>
<ul>
<li><code>monitoring</code> section also specifies the location for monitoring reports.</li>
<li>You may update the reports directory or filenames in a single place. It’s
handy!</li>
</ul>
</li>
</ul>
<h3 id="log-monitoring-metrics-with-dvclive-and-visualize-them-in-vs-code-ide" style="position:relative;">Log monitoring metrics with DVCLive and visualize them in VS Code IDE<a href="#log-monitoring-metrics-with-dvclive-and-visualize-them-in-vs-code-ide" aria-label="log monitoring metrics with dvclive and visualize them in vs code ide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> provides a Python API to log metrics,
plots, models, and other artifacts from code. Metrics and plots saved with
DVCLive can be automatically visualized in DVC extension for VS Code.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2ad21f7263dfbc0cf88d0bc7c9a2b90/39600/11-metrics-vscode.png" alt="Metrics in DVC Extension for VS Code" title="Metrics in DVC Extension for VS Code" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Metrics in DVC Extension for VS Code</em></p>
<p>Let’s explore an example of the <code>src/stages/evaluate.py</code> script to demonstrate
how DVCLive can help in DVC projects.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token comment"># Build a report</span>
model_performance_report <span class="token operator">=</span> Report<span class="token punctuation">(</span>metrics<span class="token operator">=</span><span class="token punctuation">[</span>RegressionPreset<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
model_performance_report<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span>
<span class="token comment"># Extract metrics</span>
regression_metrics<span class="token punctuation">:</span> Dict <span class="token operator">=</span> model_performance_report<span class="token punctuation">.</span>as_dict<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">'metrics'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'result'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"current"</span><span class="token punctuation">]</span>
metric_names <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'r2_score'</span><span class="token punctuation">,</span> <span class="token string">'rmse'</span><span class="token punctuation">,</span> <span class="token string">'mean_error'</span><span class="token punctuation">,</span> <span class="token string">'mean_abs_error'</span><span class="token punctuation">,</span> <span class="token string">'mean_abs_perc_error'</span><span class="token punctuation">]</span>
selected_metrics <span class="token operator">=</span> <span class="token punctuation">{</span>k<span class="token punctuation">:</span> regression_metrics<span class="token punctuation">.</span>get<span class="token punctuation">(</span>k<span class="token punctuation">)</span> <span class="token keyword">for</span> k <span class="token keyword">in</span> metric_names<span class="token punctuation">}</span>
<span class="token comment"># Save evaluation metrics with DVCLive</span>
<span class="token keyword">with</span> Live<span class="token punctuation">(</span><span class="token builtin">dir</span><span class="token operator">=</span><span class="token builtin">str</span><span class="token punctuation">(</span>REPORTS_DIR<span class="token punctuation">)</span><span class="token punctuation">,</span>
dvcyaml<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>pdir<span class="token punctuation">}</span></span><span class="token string">/dvc.yaml"</span></span><span class="token punctuation">,</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span>
<span class="token punctuation">[</span>live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>k<span class="token punctuation">,</span> v<span class="token punctuation">,</span> plot<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token keyword">for</span> k<span class="token punctuation">,</span>v <span class="token keyword">in</span> selected_metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span></code></pre></div>
<p>This code snippet demonstrates how to log machine learning model performance
metrics calculated with Evidently using DVCLive. Here's a breakdown of what it
does:</p>
<ol>
<li><code>model_performance_report</code> is created using Regression Preset from Evidently.</li>
<li>The <code>model_performance_report</code> is executed with <code>.run(...)</code>, where the actual
model evaluation and metric computation occur.</li>
<li>After <code>model_performance_report</code> building completes, you may parse the
required metrics. In this example <code>selected_metrics</code> contains
<code>['r2_score', 'rmse', 'mean_error', 'mean_abs_error', 'mean_abs_perc_error']</code>.</li>
<li>Live object context logs <code>selected_metrics</code> using <code>live.log_metrics()</code>
method. There are few important arguments:
<ol>
<li><code>dir=str(REPORTS_DIR)</code> instructs DVCLive to save metrics to
<code>reports/train</code> directory</li>
<li><code>dvcyaml=f"{pdir}/dvc.yaml</code> instructs DVCLive to use <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> for the
<code>train</code> stage to add information about metrics files. The full path is
<code>pipelines/train/dvc.yaml</code> .</li>
</ol>
</li>
</ol>
<blockquote>
<p>💡 Note: If you are interested in other scenarios of DVCLive with Evidently
integration, check
<a href="https://dvc.org/doc/user-guide/integrations/evidently" target="_blank" rel="nofollow noopener noreferrer">this integration example</a></p>
</blockquote>
<h3 id="versioning-the-reference-dataset-and-monitoring-reports" style="position:relative;">Versioning the Reference Dataset and Monitoring Reports<a href="#versioning-the-reference-dataset-and-monitoring-reports" aria-label="versioning the reference dataset and monitoring reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This example shows that DVC allows easily managed reference datasets for
monitoring purposes, and version monitoring reports themselves.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/60b6965e5137e8d26a2495ad004864e5/39600/12-versioning.png" alt="Versioning reference datasets with DVC" title="Versioning reference datasets with DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Versioning reference datasets with DVC</em></p>
<p>There are a few benefits for versioning reference datasets and monitoring
reports with DVC:</p>
<ul>
<li><strong>Registry of Reference Datasets:</strong> DVC helps store, version, and download
datasets for monitoring purposes. You may need to download the reference
dataset saved to cloud storage for a monitoring job in the production
environment. DVC makes life easier!</li>
<li><strong>Traceability</strong>: This practice ensures traceability, allowing you to link
model performance back to specific data versions.</li>
<li><strong>Version Control of Reports</strong>: You may want to manage all monitoring reports
with DVC. It ensures a historical record of your model's performance and data
quality.</li>
</ul>
<h2 id="-summing-up" style="position:relative;">🎨 Summing up<a href="#-summing-up" aria-label=" summing up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The combination of DVC and Evidently in automating data and monitoring pipelines
offers a structured and efficient approach to ML model management. This setup
enhances the reproducibility and reliability of your ML workflows and provides a
clear framework for monitoring and improving your models over time. With this
setup, you're well-equipped to maintain high-quality ML models responsive to the
dynamic nature of real-world data.</p>
<p>However, this tutorial covers only a single approach for DVC and Evidently
integration. We still working on other interesting scenarios and looking for
community support! Stay tuned!</p>
<blockquote>
<p>💡 Did you find this tutorial interesting? Please, leave your comments and
share your experience with DVC and Evidently! Join us on
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p>
</blockquote>
<h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">How to break a model in 20 days. A tutorial on production model analytics</a></li>
<li><a href="https://iterative.ai/blog/turn-vs-code-into-ml-platform" target="_blank" rel="nofollow noopener noreferrer">Turn Your Favorite IDE into a Full Machine Learning Experimentation Platform</a></li>
</ul>https://dvc.org/blog/dvc-git-lfshttps://dvc.org/blog/dvc-git-lfsWed, 03 Jan 2024 00:00:00 GMT<p>One of the main features provided by DVC is the ability to <a href="https://dvc.org/doc/command-reference/import#example-importing-from-any-git-repository" title="dvc import" target="_blank" rel="nofollow noopener noreferrer">import</a> and
<a href="https://dvc.org/doc/command-reference/get#examples-get-a-misc-git-tracked-file" title="dvc get" target="_blank" rel="nofollow noopener noreferrer">download</a> files from any Git repository. In prior releases this came with
the caveat where projects which use <a href="https://git-lfs.com/" target="_blank" rel="nofollow noopener noreferrer">Git LFS</a> were
unsupported. As of version 3.31.0, DVC now supports reading Git LFS objects, so
you can now <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> upstream datasets from platforms like
<a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Hugging Face</a> which use Git LFS, without needing to
install any additional dependencies! Read on for an overview on how the DVC Git
LFS client was implemented.</p>
<p><em>To get started using DVC with Hugging Face, please refer to the DVC
integrations <a href="https://dvc.org/doc/user-guide/integrations/huggingface" title="DVC/Hugging Face Integration" target="_blank" rel="nofollow noopener noreferrer">documentation</a></em></p>
<p>DVC builds on top of Git's versioning capabilities using the open source
libraries <a href="https://www.dulwich.io/" target="_blank" rel="nofollow noopener noreferrer">Dulwich</a> and
<a href="https://www.pygit2.org/" target="_blank" rel="nofollow noopener noreferrer">pygit2</a> (which provides Python bindings for the C
library <a href="https://github.com/libgit2/libgit2" target="_blank" rel="nofollow noopener noreferrer">libgit2</a>). Using these libraries
provides DVC with access to Git functionality without requiring a traditional
command line Git installation, which can be particularly useful in containerized
environments. When integrating support for Git LFS support into DVC, we wanted
to keep the same approach, so DVC users could simply install DVC, and then
import and download files from any Git repository, regardless of whether or not
that repository uses Git LFS. Neither Dulwich nor libgit2/pygit2 support Git LFS
natively, but libgit2 does provide an API for the low level Git filters
functionality used by Git LFS. We have <a href="https://github.com/libgit2/pygit2/pull/1237" title="pygit2 filters pull request" target="_blank" rel="nofollow noopener noreferrer">contributed</a> to pygit2 so
that pygit2 users (like DVC) can now write libgit2 filters purely in Python,
without needing to use the lower level libgit2 C API.</p>
<p><em>DVC's Git client library (which wraps Dulwich and pygit2) is available
<a href="https://github.com/iterative/scmrepo" target="_blank" rel="nofollow noopener noreferrer">here</a></em></p>
<h2 id="git-filters-overview" style="position:relative;">Git filters overview<a href="#git-filters-overview" aria-label="git filters overview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Git supports using attribute <a href="https://git-scm.com/docs/gitattributes#_filter" title="Git attributes filters" target="_blank" rel="nofollow noopener noreferrer">filters</a> to manipulate how objects are
stored internally in Git compared to how they are stored in your workspace. One
commonly used built-in filter is the CRLF filter, which will adjust line endings
in text files. The CRLF filter is typically used to ensure that files are
checked out into the workspace using the appropriate line endings for the user's
platform (linefeed on Unix and carriage return + linefeed on Windows), but are
only stored in Git with Unix-style line endings.</p>
<p>Git LFS also works by using Git filters. When you add a file with the
<code>filter=lfs</code> attribute to Git, The Git LFS filter generates a "pointer" for Git
to store internally. The LFS pointer is a small text file containing a SHA256
LFS object ID for the original file. The Git LFS filter places the original file
in Git LFS storage, and then outputs the pointer to Git (instead of the original
file). Upon checkout, Git passes the pointer to the Git LFS filter, which then
reads the LFS object ID and checks out the appropriate original file into your
workspace.</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version https://git-lfs.github.com/spec/v1
oid sha256:b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
size 4</code></pre></div>
<p><em>Example Git LFS pointer</em></p>
<h2 id="libgit2-and-pygit2-filters" style="position:relative;">libgit2 and pygit2 filters<a href="#libgit2-and-pygit2-filters" aria-label="libgit2 and pygit2 filters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When saving objects in Git and when checking them back out to the workspace,
libgit2 runs a chain of registered filters. Each filter in the chain modifies
the object data as needed, and then passes the modified result into the next
filter. While writing a libgit2 filter in C is fairly complex and requires
implementing multiple levels of callback structs for handling the underlying
buffered write streams in addition to the filter itself, this is simplified by
our newly contributed support for Python filters in pygit2. The low level
libgit2 APIs are abstracted away, and a subclassed <code>pygit2.Filter</code>implementation
only needs to implement three basic methods, <code>check()</code>, <code>write()</code> and <code>close()</code>.</p>
<ul>
<li><code>Filter.check()</code> is called prior to processing any object with Git attributes
which match the registered filter, and the filter can verify whether or not it
should be used with the given object, or indicate that the filter does not
need to be applied.</li>
<li><code>Filter.write()</code> is called one or more times and is used to “write” input data
chunks to the filter.</li>
<li><code>Filter.close()</code> is called after all of the input data has been written to the
filter.</li>
</ul>
<p>The filter can send output data chunks to the next filter in the chain as needed
via the <code>write_next()</code> callback.</p>
<p><em>Note: in Git, <code>smudge</code> filters are run when checking out objects from the Git
object database into the workspace, and <code>clean</code> filters are run when saving
objects from the workspace into the Git object database. In libgit2 and pygit2,
a single filter is registered which is used in both cases, and the direction is
indicated by the <code>mode</code> parameter.</em></p>
<h2 id="the-scmrepo-git-lfs-filter" style="position:relative;">The scmrepo Git LFS filter<a href="#the-scmrepo-git-lfs-filter" aria-label="the scmrepo git lfs filter permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thanks to this higher level abstraction in pygit2, implementing the Git LFS
<code>smudge</code> filter in Python is straightforward:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">def</span> <span class="token function">check</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> src<span class="token punctuation">:</span> <span class="token string">"FilterSource"</span><span class="token punctuation">,</span> attr_values<span class="token punctuation">:</span> List<span class="token punctuation">[</span><span class="token builtin">str</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">if</span> attr_values<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">"lfs"</span><span class="token punctuation">:</span>
<span class="token keyword">if</span> src<span class="token punctuation">.</span>mode <span class="token operator">!=</span> GIT_FILTER_CLEAN<span class="token punctuation">:</span>
self<span class="token punctuation">.</span>_smudge_buf <span class="token operator">=</span> io<span class="token punctuation">.</span>BytesIO<span class="token punctuation">(</span><span class="token punctuation">)</span>
self<span class="token punctuation">.</span>_smudge_root <span class="token operator">=</span> src<span class="token punctuation">.</span>repo<span class="token punctuation">.</span>workdir <span class="token keyword">or</span> src<span class="token punctuation">.</span>repo<span class="token punctuation">.</span>path
<span class="token keyword">return</span>
<span class="token keyword">raise</span> Passthrough</code></pre></div>
<p>In <code>check()</code>, the first element in <code>attr_values</code> will contain the object’s
<code>filter</code> Git attribute. We verify that the object has <code>filter=lfs</code> set and that
we are in <code>smudge</code> mode (our filter is currently read-only and does not need to
implement <code>clean</code> mode). When in <code>smudge</code> mode we initialize an internal buffer
which will be used for reading the pointer data from Git, as well as storing the
original Git repository root path (which will be needed later).</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">write</span><span class="token punctuation">(</span>
self<span class="token punctuation">,</span> data<span class="token punctuation">:</span> <span class="token builtin">bytes</span><span class="token punctuation">,</span> src<span class="token punctuation">:</span> <span class="token string">"FilterSource"</span><span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span>
<span class="token punctuation">)</span><span class="token punctuation">:</span>
…
self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">.</span>write<span class="token punctuation">(</span>data<span class="token punctuation">)</span></code></pre></div>
<p>In <code>write()</code> we append the input chunk to our buffer and then return. We do not
write to the next filter, since Git LFS <code>smudge</code> depends on reading the entire
pointer input before we can output any data.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">close</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
…
self<span class="token punctuation">.</span>_smudge<span class="token punctuation">(</span>write_next<span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">_smudge</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
…
self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">.</span>seek<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
<span class="token keyword">with</span> Git<span class="token punctuation">(</span>self<span class="token punctuation">.</span>_smudge_root<span class="token punctuation">)</span> <span class="token keyword">as</span> scm<span class="token punctuation">:</span>
<span class="token keyword">try</span><span class="token punctuation">:</span>
url <span class="token operator">=</span> get_fetch_url<span class="token punctuation">(</span>scm<span class="token punctuation">)</span>
<span class="token keyword">except</span> InvalidRemote<span class="token punctuation">:</span>
url <span class="token operator">=</span> <span class="token boolean">None</span>
fobj <span class="token operator">=</span> smudge<span class="token punctuation">(</span>scm<span class="token punctuation">.</span>lfs_storage<span class="token punctuation">,</span> self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">,</span> url<span class="token operator">=</span>url<span class="token punctuation">)</span>
data <span class="token operator">=</span> fobj<span class="token punctuation">.</span>read<span class="token punctuation">(</span>io<span class="token punctuation">.</span>DEFAULT_BUFFER_SIZE<span class="token punctuation">)</span>
<span class="token keyword">try</span><span class="token punctuation">:</span>
<span class="token keyword">while</span> data<span class="token punctuation">:</span>
write_next<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
data <span class="token operator">=</span> fobj<span class="token punctuation">.</span>read<span class="token punctuation">(</span>io<span class="token punctuation">.</span>DEFAULT_BUFFER_SIZE<span class="token punctuation">)</span>
<span class="token keyword">except</span> KeyboardInterrupt<span class="token punctuation">:</span>
<span class="token keyword">return</span></code></pre></div>
<p>In <code>close()</code>, we get the configured Git LFS remote URL (if it is set) and then
run our actual <code>smudge()</code> implementation. scmrepo’s <code>smudge()</code> method will
return a Python file-like object stream for the original file (and not the
internal pointer). We then just need to do a series of chunked reads and writes
to send the original file data to the next filter in the chain.</p>
<p>Since Git LFS <code>smudge</code> behavior is well defined by the <a href="https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md#intercepting-git" title="Git LFS specification" target="_blank" rel="nofollow noopener noreferrer">Git LFS
specification</a> we will not go into a detailed explanation of our
Python implementation here. In short, <code>smudge()</code> verifies that the input data is
a valid Git LFS pointer, reads the Git LFS object ID from the pointer, and then
loads the appropriate object from Git LFS storage. If the specified object ID is
not available in the local Git LFS storage, it will be fetched from the remote
Git LFS server.</p>
<p><em>The complete source code for our scmrepo Git LFS filter is available on Github:
<a href="https://github.com/iterative/scmrepo/blob/main/src/scmrepo/git/backend/pygit2/filter.py" title="scmrepo filter.py" target="_blank" rel="nofollow noopener noreferrer">filter.py</a>, <a href="https://github.com/iterative/scmrepo/blob/main/src/scmrepo/git/lfs/smudge.py" title="scmrepo smudge.py" target="_blank" rel="nofollow noopener noreferrer">smudge.py</a></em></p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This recent update to DVC marks a significant milestone by eliminating the prior
limitation associated with Git LFS incompatibility. With version 3.31.0, DVC
users can seamlessly import files from Git repositories, including platforms
like Hugging Face, without needing extra dependencies. The integration of Git
LFS support, facilitated by the Dulwich and pygit2 libraries, streamlines
managing datasets and large objects in a Git repository.</p>
<p>This reinforces DVC's commitment to providing a versatile and user-friendly
open-source version control solution for diverse Git repositories.</p>https://dvc.org/blog/turn-vs-code-into-ml-platformhttps://dvc.org/blog/turn-vs-code-into-ml-platformThu, 16 Nov 2023 00:00:00 GMT<p><strong>Need an easy way to run and track your experiments?</strong> Install the DVC
extension from the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code marketplace</a>.
Then, run experiments, visualize deep learning metrics in real-time, compare
experiments, and save the ones you like - all from your IDE.</p>
<p><img src="https://dvc.org/2023-11-16/run-python-file-f7dd9309e0f6abf350eac3c8a083cef2.gif" alt="Run Python file"><em>Run a
Python file and see results</em></p>
<p><strong>Want to simplify your chaotic ML iterations?</strong> With the DVC extension, you can
run reproducible workflows directly from VS Code.</p>
<p><img src="https://dvc.org/2023-11-16/modify-and-run-00de7b58ccfe3155924ed5da316ce9b8.gif" alt="Run a new experiment"><em>Run a
new experiment directly from VS Code</em></p>
<p>Live plots let you visualize metrics from these runs in real-time.</p>
<p><img src="https://dvc.org/2023-11-16/live-plots-cacfadaa43860e33db3ac9286fde2881.gif" alt="View plots in real-time"><em>View
plots in real-time</em></p>
<p>To make it easy for you to create the workflows, the extension even
auto-generates code snippets.</p>
<p><img src="https://dvc.org/2023-11-16/auto-generate-code-fffffd810d133db3cc2bd2b08aa2cb99.gif" alt="Auto-generate pipeline specifications"><em>Auto-generate
pipeline specifications</em></p>
<p><strong>Tired of context switching throughout the day?</strong> The integration of DVC with
VS Code empowers you to do everything from within your IDE. No more jumping from
notebooks to the terminal to IDE to web browsers to Git.</p>
<h1 id="why-a-dvc-extension-for-vs-code" style="position:relative;">Why a DVC extension for VS Code?<a href="#why-a-dvc-extension-for-vs-code" aria-label="why a dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> has helped individual ML developers and teams in
companies like UBS, DeGould, Exscientia, Kibsi and many more to standardize
their ML workflows on top of their cloud resources and Git repositories.</p>
<p>Visual Studio Code (VS Code) is, by far, the most popular IDE for all
developers, including ML engineers.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ffd03af5b25e2fc25799a4bf5a38f6e7/39600/so-survey.png" alt="StackOverflow survey" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Source:
<a href="https://survey.stackoverflow.co/2022/#section-most-popular-technologies-integrated-development-environment" target="_blank" rel="nofollow noopener noreferrer">StackOverflow survey 2022</a></em></p>
<p>The DVC extension makes VS Code even more useful for you by providing you a VS
Code-native environment for managing your ML projects. You get the power of DVC
with capabilities beyond what's available in the terminal!</p>
<p>With over 34 thousand installs, the extension is proven to help you solve the
challenges of creating and managing your Machine Learning workflows.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d7618fc8b5efaf80fc98f5fa6a4767ad/39600/dvc-extension-in-vs-code-marketplace.png" alt="DVC extension in the VS Code marketplace" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>DVC
extension in the VS Code marketplace</em></p>
<h1 id="getting-started-with-the-dvc-extension-for-vs-code" style="position:relative;">Getting started with the DVC extension for VS Code<a href="#getting-started-with-the-dvc-extension-for-vs-code" aria-label="getting started with the dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>To install the extension, open VS Code and search for "DVC" in the Extensions
view. Or install the extension from the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code marketplace</a>.</p>
<p>Now, create a DVC repository for your machine learning project and start
experimenting! Here’s how to do this:</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/6KtIRVfr61E?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>To see all the DVC commands supported by the extension, open the DVC Command
Palette using F1 or ⇧⌃P on Windows/Linux or ⇧⌘P on macOS and typing DVC.</p>
<h1 id="its-always-getting-better" style="position:relative;">It’s always getting better!<a href="#its-always-getting-better" aria-label="its always getting better permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Over the last one year, we’ve made several enhancements in the DVC extension for
VS Code. For some of the interesting stuff you can do with it, watch the videos
<a href="https://www.youtube.com/watch?v=VMYggTLm_-U&list=PL7WG7YrwYcnBo3ZBapzKNxtBcfNjGDQMM&index=5" target="_blank" rel="nofollow noopener noreferrer">here</a>.
As a mark of the extension reaching a new level of maturity, today we have
launched it in
<a href="https://www.producthunt.com/posts/dvc-extension-for-vs-code" target="_blank" rel="nofollow noopener noreferrer">Product Hunt</a>. It
would be awesome if you check it out and leave us some feedback and support!</p>
<p>We are excited to see how the DVC VS Code extension helps you simplify your ML
workflows. For more information:</p>
<ul>
<li>DVC extension in the VS Code marketplace:
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">https://marketplace.visualstudio.com/items?itemName=Iterative.dvc</a></li>
<li>GitHub repository: <a href="https://github.com/iterative/vscode-dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/vscode-dvc</a></li>
<li>DVC documentation: <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/</a></li>
<li>DVC community forum: <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a></li>
</ul>https://dvc.org/blog/leveraging-llms-in-chatbots-the-dvc-approachhttps://dvc.org/blog/leveraging-llms-in-chatbots-the-dvc-approachMon, 25 Sep 2023 00:00:00 GMT<p>In the modern world of Machine Learning (ML) and Natural Language Processing
(NLP), there's been a surge in applications built on top of Large Language
Models (LLMs). There has been an almost exponential adoption in applications and
companies building applications from LLMs across a variety of areas.</p>
<p>In this post we will show how DVC can make designing LLM applications more
efficient and organized. We take a Retrieval-Augmented Generation
(<a href="https://artificialcorner.com/retrieval-augmented-generation-rag-a-short-introduction-21d0044d65ff" target="_blank" rel="nofollow noopener noreferrer">RAG</a>)
approach and illustrate how we can break down the various phases of a RAG
chatbot and version them with DVC. We can use DVC to both "time travel" and
avoid the need to re-compute stages unnecessarily with little extra effort.</p>
<h2 id="the-rise-of-chatbots-in-technical-advice" style="position:relative;">The Rise of Chatbots in Technical Advice<a href="#the-rise-of-chatbots-in-technical-advice" aria-label="the rise of chatbots in technical advice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Chatbots are finding a natural fit in providing technical advice. For our
product, DVC, which has amassed significant popularity, we've introduced a
chatbot designed to streamline user experience. Our bot sources information not
just from our official documentation but also from our community discussions on
Discord. This creates a broader knowledge base than using our official
documentation alone, and ensures a balanced mix of official guidelines and
community insights.</p>
<h2 id="the-rag-approach" style="position:relative;">The RAG Approach<a href="#the-rag-approach" aria-label="the rag approach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our chatbot uses the Retrieval-Augmented Generation (RAG) approach. The
<a href="https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7" target="_blank" rel="nofollow noopener noreferrer">debate</a>
between the efficacy of RAG vs. fine-tuning methods is ongoing and lively.
However, our choice leans towards RAG due to its simplicity and relative
computation efficiency for quickly iterating on different approaches.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf4e94f792cc00d8e6f80269a1a3e8bd/39600/flowchart.png" alt="RAG flowchart" title="RAG flowchart" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Illustration
of the RAG approach: First we build a vector database with chunks of text. After
we retrieve chunks relevant to the user query from the vector database, we
insert those chunks into the prompt to give the LLM context.</em></p>
<h2 id="citation-a-key-differentiator" style="position:relative;">Citation: A Key Differentiator<a href="#citation-a-key-differentiator" aria-label="citation a key differentiator permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A common complaint about chatbots is that they do not cite any sources, which
leaves users with few avenues to validate the information provided by the
chatbot.</p>
<p><img src="https://dvc.org/2023-09-25/chat_bot_gif-c5be4d288070c6cc9d0a913486e15dc1.gif" alt="Chatbot in action video" title="=800"><em>Demo
of our chatbot</em></p>
<p>Our chatbot is able to cite the sources of its answers. It does with using the
LangChain
<a href="https://api.python.langchain.com/en/latest/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html" target="_blank" rel="nofollow noopener noreferrer">RetrievalQAWithSourcesChain</a>.
This is a key feature for many users.</p>
<h2 id="building-the-chatbot-using-dvc" style="position:relative;">Building the Chatbot Using DVC<a href="#building-the-chatbot-using-dvc" aria-label="building the chatbot using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our chatbot builds on top of the
<a href="https://github.com/hwchase17/notion-qa" target="_blank" rel="nofollow noopener noreferrer">LangChain Notion Question-Answering</a>
example using DVC to manage the pipeline. Interestingly, while we built a
chatbot for DVC, we also employed DVC in its construction. This seemingly
circular approach allowed us to leverage the standard benefits that DVC offers:</p>
<ol>
<li><strong>Rollback Facility</strong>: The ability to revert to previous versions is
invaluable, especially when dealing with unpredictable outputs in response to
varying prompts.</li>
<li><strong>Efficiency</strong>: DVC prevents redundant computation when updating specific
phases, saving both time and computational resources.</li>
<li><strong>Visual Representation with DVC DAG</strong>: The Directed Acyclic Graph (DAG)
provided by DVC visualizes how the chatbot's construction is broken down into
distinct stages, aiding understanding and development.</li>
</ol>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">+----------------------+
| discord_dump.zip.dvc |
+----------------------+
+-------------------+
| docs_dump.zip.dvc |
+-------------------+
*
*
*
+--------+
| expand |
+--------+
*
*
*
+--------+
| ingest |
+--------+
*
*
*
+-----------+ +-----------------+
| vectorize | | samples.txt.dvc |
+-----------+ +-----------------+
*** ***
* *
** **
+-----+
| run |
+-----+</code></pre></div>
<p>The bot is built into a few standard phases for RAG:</p>
<ol>
<li><code>expand</code>: unzip archives of documents</li>
<li><code>ingest</code>: This is how we chunk up the text of the documents into small pieces
that we can embed and also put into prompts for the chatbot. The standard
<a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter" target="_blank" rel="nofollow noopener noreferrer">text splitters</a>
make sense for documentation pages, but a dump of 2 years worth of Discord
chats require a custom splitter.</li>
<li><code>vectorize</code>: Build a
<a href="https://github.com/facebookresearch/faiss" target="_blank" rel="nofollow noopener noreferrer">vector database</a> with embeddings
of all the text chunks</li>
<li><code>run</code>: Extract the relevant text chunks for the sample questions, put into
prompts, and call the LLM</li>
</ol>
<p>DVC allows us to keep the outputs from each stage under version control, and
manage the parameterization, with little extra effort. This provides the
advantage that if we choose to update the vectorize stage, we can reuse the
outputs of the ingest stage without re-running it. Or, if we want to roll back
to an old version of vectorize, we can get that intermediate output back without
re-running it and without the high chance of making a mistake in versioning if
we try to do that manually.</p>
<p>Both the vectorize and run stages use the OpenAI API. So, repeated computation
not only costs time but also actual dollars.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 616px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b381d3c919972544273b5ac34e7be75a/0e253/docs_text_chunking.png" alt="Text chunking the official docs" title="Text chunking the official docs" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>We
apply a standard text chunker to the markdown for our official documentation. It
contains a few options for chunk size and desired overlap between chunks. DVC
helps organize these parameters.</em></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 504px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4e25c4bf35758804f5933504eadd3a9d/0dcb2/discord_text_chunking.png" alt="Text chunking the public discord" title="Text chunking the public discord" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>For
our discord, we group together successive messages from the same author and then
start a chunk at each message. Putting the author and datetime into the prompts
in the later stages can be formatted in various ways. Experimenting with these
options is easier when you have DVC.</em></p>
<h2 id="the-importance-of-rollback" style="position:relative;">The Importance of Rollback<a href="#the-importance-of-rollback" aria-label="the importance of rollback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Changes in chatbot prompts can have unforeseen consequences. In some cases, they
might improve the bot's performance, while in others, they might lead to
degradation. Given the computational cost of re-running phases and the
unpredictable nature of such changes, rollback doesn't merely refer to reverting
to old code. It also allows reverting to older intermediate outputs, making the
development process much more computationally efficient and organized.</p>
<h2 id="incorporating-the-discord-community-insights" style="position:relative;">Incorporating the Discord Community Insights<a href="#incorporating-the-discord-community-insights" aria-label="incorporating the discord community insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>One significant factor affecting the performance of our chatbot is the manner in
which we segment and integrate text from our Discord channel. Different
text-splitting techniques can lead to variance in performance, highlighting the
importance of continually refining this integration process. Furthermore,
providing useful meta information for sources in Discord can be done in various
ways. Again, DVC handles the book keeping of iterating on these approaches
without re-running unchanged stages.</p>
<h2 id="running-it-yourself" style="position:relative;">Running it Yourself<a href="#running-it-yourself" aria-label="running it yourself permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First clone the git repository <a href="https://github.com/iterative/llm-demo" target="_blank" rel="nofollow noopener noreferrer">here</a>.
Once you have an <a href="https://platform.openai.com/account/api-keys" target="_blank" rel="nofollow noopener noreferrer">OpenAI API key</a>,
you can easily get the project going with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>. Re-running the demo from
scratch costs about $0.40 USD in credits.</p>
<p>First, you need to do a git pull of the code:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:iterative/llm-demo.git
</span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> llm-demo</span></code></pre></div>
<p>The training run is all logged in DVC in an S3 store. So, if you are already
authenticated on AWS you can get all the data with:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div>
<p>To set your environment up to run the code, first install all requirements in a
virtual env:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">virtualenv</span> <span class="token function">env</span> <span class="token parameter variable">--python</span><span class="token operator">=</span>python3
</span><span class="token line"><span class="token input">$ </span><span class="token command">source</span> env/bin/activate
</span><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</span></code></pre></div>
<p>Then set your OpenAI API key (if you don't have one, get one
<a href="https://beta.openai.com/playground" target="_blank" rel="nofollow noopener noreferrer">here</a>):</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">OPENAI_API_KEY</span><span class="token operator">=</span><span class="token punctuation">..</span>.</span></code></pre></div>
<p>The preceding spaces prevent the API key from staying in your bash history if
that is
<a href="https://stackoverflow.com/questions/6475524/how-do-i-prevent-commands-from-showing-up-in-bash-history" target="_blank" rel="nofollow noopener noreferrer">configured</a>.</p>
<p>Now you should be ready to re-run the training pipeline. Assuming you have not
changed anything, nothing should need to run. Everything can be re-used for the
DVC pull:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div>
<p>Now you can startup the web UI using:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">streamlit</span> run main.py</span></code></pre></div>
<p>The command should open the bot in your web browser. The log of interactions can
be found in <code>chat.log</code>.</p>
<h2 id="example-of-using-dvc-rollback" style="position:relative;">Example of using DVC rollback<a href="#example-of-using-dvc-rollback" aria-label="example of using dvc rollback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Let's take a concrete example illustrating how we can use DVC in the bot
development, suppose we want to adjust the <code>embedding embedding_ctx_length</code>
because we think it can help us save some cost on API calls and lower the
interactive latency. To do this in a reproducible way, we first make a git
branch to do the change:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> try_new_embed</span></code></pre></div>
<p>Now if we re-run the pipeline with DVC we will notice that it skips re-running
the expand and ingest phases because nothing has changed for their dependencies:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'OpenAIEmbeddings.embedding_ctx_length=256'</span>
</span>'samples.txt.dvc' didn't change, skipping
Stage 'setup' didn't change, skipping
'docs_dump.zip.dvc' didn't change, skipping
Stage 'expand' didn't change, skipping
Stage 'ingest' didn't change, skipping
Running stage 'vectorize':
<span class="token line"><span class="token input">$ </span><span class="token command">python</span> vector_store.py
</span>...</code></pre></div>
<p>We can also version the outputs with DVC:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> dvc.lock params.yaml
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"new embed model"</span></span></code></pre></div>
<p>We can try out the new settings with:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">streamlit</span> run main.py</span></code></pre></div>
<p>However, if despite any cost savings we don't like the results with these new
settings, we can easily revert back to old pipeline using git and DVC:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> master
</span>Switched to branch 'master'
Your branch is up to date with 'origin/master'.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span>
</span>M faiss_store.pkl
M docs.index
M results.csv
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>'samples.txt.dvc' didn't change, skipping
Stage 'setup' didn't change, skipping
'docs_dump.zip.dvc' didn't change, skipping
Stage 'expand' didn't change, skipping
Stage 'ingest' didn't change, skipping
Stage 'vectorize' didn't change, skipping
Stage 'run' didn't change, skipping
Data and pipelines are up to date.</code></pre></div>
<p>DVC does not need to rerun any stage because it has saved all the old outputs
from the master branch. Likewise, we can always switch back to the experimental
setup with:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> try_new_embed
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div>
<p>Using these few commands, we can use DVC to both "time travel" and avoid the
need to re-compute stages unnecessarily with little extra effort.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The benefits of using DVC are shared across most LLM applications. Whether you
are working with discord, slack, or a google docs corpus, RAG or fine tuning,
using DVC to manage your pipeline will bring similar benefits. The utilization
of DVC not only enhances the development process but also brings about
reproducible experiments. Given the similarities that most LLM applications
share, it's safe to conclude that they could benefit immensely from
incorporating DVC in their workflows.</p>https://dvc.org/blog/finetune-llm-pipeline-dvc-skypilothttps://dvc.org/blog/finetune-llm-pipeline-dvc-skypilotFri, 08 Sep 2023 00:00:00 GMT<h2 id="introduction---solving-cloud-resources-and-reproducibility-for-llms" style="position:relative;">Introduction - Solving cloud resources and reproducibility for LLMs<a href="#introduction---solving-cloud-resources-and-reproducibility-for-llms" aria-label="introduction solving cloud resources and reproducibility for llms permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A few of weeks ago, I wrote a
<a href="https://alex000kim.com/tech/2023-08-10-ml-experiments-in-cloud-skypilot-dvc/" target="_blank" rel="nofollow noopener noreferrer">post</a>
about the challenges of training large ML models, in particular:</p>
<ol>
<li>the need for more computing power and the complexity of managing cloud
resources;</li>
<li>the difficulty of keeping track of ML experiments and reproducing results.</li>
</ol>
<p>There I proposed a solution to these problems by using
<a href="https://skypilot.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a> and
<a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to manage cloud resources and track experiments,
respectively.</p>
<p>These problems are especially relevant for large language models, where both the
model size and the amount of data required for training are <em>very</em> large. In
this blog post, I will walk you through an end-to-end production-grade Machine
Learning pipeline for performing Supervised Fine-Tuning (SFT) of large language
models (LLMs) on conversational data. This project demonstrates the effective
use of technologies like <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>,
<a href="https://github.com/skypilot-org/skypilot" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a>, HuggingFace
<a href="https://github.com/huggingface/transformers" target="_blank" rel="nofollow noopener noreferrer">Transformers</a>,
<a href="https://github.com/huggingface/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a>,
<a href="https://github.com/huggingface/trl" target="_blank" rel="nofollow noopener noreferrer">TRL</a> and others.</p>
<p>All the code for this project is available on GitHub:</p>
<p><a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace" target="_blank" rel="nofollow noopener noreferrer">https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace</a></p>
<h3 id="whats-fine-tuning-and-when-to-use-it" style="position:relative;">What’s fine-Tuning and when to use it<a href="#whats-fine-tuning-and-when-to-use-it" aria-label="whats fine tuning and when to use it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s recap the differences between prompt engineering, prompt tuning, and model
fine-tuning, three distinct approaches to working with LLMs.</p>
<p>Feel free to skip this section if you’re already familiar with these concepts.</p>
<details>
<summary> Prompt engineering, prompt tuning, and model fine-tuning </summary>
<p>Prompt engineering, prompt tuning, and model fine-tuning are three
techniques for adapting large language models to downstream tasks. Prompt
engineering relies on skillfully designing input prompts, often with demo
examples, to steer model behavior without any parameter changes. Prompt
tuning takes a more automated approach - learning continuous token
embeddings as tunable prompts appended to the input. This keeps the base
model frozen but allows the prompts to be optimized. Finally, model
fine-tuning adapts all the model’s parameters directly through continued
training on downstream data. While fine-tuning can achieve strong
performance, prompt engineering and tuning offer greater parameter
efficiency and model reuse. However, prompt methods may require more
iteration and heuristics to work well.</p>
<p>Fine-tuning gives the model maximal flexibility to adapt its entire set (or a
subset) of parameters directly on the new data. This end-to-end training
approach is especially powerful when the target task or domain differs
significantly from the original pre-training data. In such cases, extensive
adaptation of the model may be required beyond what is possible through the
model’s fixed input representations alone. However, fine-tuning requires
re-training large models which can be computationally expensive. It also loses
the ability to efficiently share one model across multiple tasks. Overall,
fine-tuning tends to be preferred when maximum task performance is critical and
training resources are available.</p>
<p>Below is a table comparing these techniques:</p>
<table><thead><tr><th>Method</th><th>Description</th><th>Advantages</th><th>Disadvantages</th></tr></thead><tbody><tr><td>Prompt Engineering</td><td>Skillfully designing input prompts, often with demo examples, to steer model behavior without parameter changes</td><td>• Efficient parameter reuse <br> • No model re-training needed</td><td>• Can require much iteration and tuning <br> • Limited flexibility to adapt model</td></tr><tr><td>Prompt Tuning</td><td>Learning continuous token embeddings as tunable prompts appended to input, keeps base model frozen</td><td>• Efficient parameter reuse <br> • Automated prompt optimization</td><td>• Less flexible than fine-tuning <br> • Still some manual effort needed</td></tr><tr><td>Model Fine-tuning</td><td>Adapting a subset of model parameters through continued training on new data</td><td>• Allows significant adaptation to new tasks/data <br> • Can achieve very strong performance</td><td>• Can be difficult to set up <br> • Computationally expensive <br> • Loses ability to share model across tasks</td></tr></tbody></table>
</details>
<h2 id="overview-of-the-project" style="position:relative;">Overview of the Project<a href="#overview-of-the-project" aria-label="overview of the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The project leverages several technologies:</p>
<ol>
<li><strong><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a></strong> for reproducible ML pipelines: This tool enables
us to define the ML workflow as a Directed Acyclic Graph (DAG) of pipeline
stages, with dependencies between data, models, and metrics automatically
tracked. It also integrates with remote storage like S3 to efficiently
version large datasets and model files.</li>
<li><strong><a href="https://skypilot.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a></strong> for scalable cloud
infrastructure: SkyPilot simplifies the process of launching cloud compute
resources on demand for development or distributed training. It supports spot
instances to reduce training costs and permits the quick set up of remote
interactive development environments.</li>
<li><strong><a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">HuggingFace</a></strong> and other libraries for efficient
training of quantized models: HuggingFace Transformers provides a simple API
for training and fine-tuning large transformer models. In combination with
bitsandbytes, it enables reduced-precision and quantization-aware training
for greater efficiency.</li>
</ol>
<p>The <a href="https://github.com/artidoro/qlora" target="_blank" rel="nofollow noopener noreferrer">QLoRA</a> quantization technique will allow
us to apply 4-bit quantization for model weights. For Llama 7b model, this
reduces GPU memory requirements from ~98 GB (with float32 precision) down to ~12
GB (with int4 precision). The screenshot below is from a handy
<a href="https://huggingface.co/spaces/hf-accelerate/model-memory-usage" target="_blank" rel="nofollow noopener noreferrer">Model Memory Calculator</a>
that helps you calculate how much vRAM is needed to train on a model that can be
found on the Hugging Face Hub.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c6afbd26db7f41a6f93eae5d877b45b1/39600/gpu_memory_requirements.png" alt="GPU memory requirements" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Considering the GPU memory overhead due to optimizer states, gradients, and
forward activations, we’d need around 16GB in vRAM to fine-tune a 4bit-quantized
7b model. NVIDIA A10 is a good candidate for this
(<a href="https://aws.amazon.com/ec2/instance-types/g5/" target="_blank" rel="nofollow noopener noreferrer"><code>g5.2xlarge</code></a> instance on AWS)
as it costs a little over $1 per hour for on-demand pricing or $0.35 per hour
for spot instance pricing.</p>
<p>The total training time will depend on the size of your dataset and the number
of epochs you want to train for. But with this setup, I believe it's possible to
train a model to achieve decent (better than the base pretrained model)
performance on some narrow task for under $50 total.</p>
<p>For comparison, if you were fine-tuning the same model but with float16
precision, you’d need one or more NVIDIA A100 (80GB version) or H100 GPUs.
Currently, they are almost impossible to get access to due to the high demand
(unless you work at one of the
<a href="https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini" target="_blank" rel="nofollow noopener noreferrer">“GPU-rich” companies</a>).
This kind of cloud hardware can be 5-10 times more expensive. For example,
according to this
<a href="https://blog.skypilot.co/finetuning-llama2-operational-guide/" target="_blank" rel="nofollow noopener noreferrer">post</a>, it would
cost you a little over $300 to fine-tune a non-quantized 7b Llama 2 model on the
<a href="https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered" target="_blank" rel="nofollow noopener noreferrer">ShareGPT</a>
dataset for 3 epochs.</p>
<p>The price, of course, isn’t the only important factor. There are other low-cost
Jupyter-based development environments like Google Colab or Kaggle Notebooks.
While Jupyter environment is convenient when developing prototypes, the key
advantage of the everything-as-code (EaC) approach proposed here is centralizing
your code, datasets, hyperparameters, model weights, training infrastructure and
development environment in a git repository. With LLMs being notoriously
unpredictable, maintaining tight version control over training is critical.</p>
<h3 id="setup" style="position:relative;">Setup<a href="#setup" aria-label="setup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To begin, clone the project repository. Then, install SkyPilot and DVC using
pip:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ pip <span class="token function">install</span> skypilot<span class="token punctuation">[</span>all<span class="token punctuation">]</span> dvc<span class="token punctuation">[</span>all<span class="token punctuation">]</span></code></pre></div>
<p>Next, configure your cloud provider credentials. You can refer to the
<a href="https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloud-account-setup" target="_blank" rel="nofollow noopener noreferrer">SkyPilot documentation</a>
for more details.</p>
<p>Confirm the setup with the following command:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky check</code></pre></div>
<p>After configuring the setup, you’ll need to download the data from the read-only
remote storage in this project to your local machine, then upload it to your own
bucket (where you have write access).</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Pull data from remote storage to local machine</span>
$ dvc pull
<span class="token comment"># Configure your own bucket in .dvc/config:</span>
<span class="token comment"># - AWS: https://iterative.ai/blog/aws-remotes-in-dvc</span>
<span class="token comment"># - GCP: https://iterative.ai/blog/using-gcp-remotes-in-dvc</span>
<span class="token comment"># - Azure: https://iterative.ai/blog/azure-remotes-in-dvc</span>
<span class="token comment"># Push the data to your own bucket</span>
$ dvc push</code></pre></div>
<h2 id="huggingface-perform-resource-efficient-fine-tuning" style="position:relative;">HuggingFace: Perform Resource Efficient Fine-Tuning<a href="#huggingface-perform-resource-efficient-fine-tuning" aria-label="huggingface perform resource efficient fine tuning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here we’ll walk through the training approach without going into too much
detail. Please check the references at the end of this post for more information
on the techniques used. We started by loading a pretrained Llama-2 model and
tokenizer. To make training even more efficient, we used <code>bitsandbytes</code> and
techniques like <a href="https://huggingface.co/blog/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a> and
<a href="https://github.com/artidoro/qlora" target="_blank" rel="nofollow noopener noreferrer">QLoRA</a> to quantize the model to 4-bit
precision.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">get_model_and_tokenizer</span><span class="token punctuation">(</span>pretrained_model_path<span class="token punctuation">,</span> use_4bit<span class="token punctuation">,</span> bnb_4bit_compute_dtype<span class="token punctuation">,</span> bnb_4bit_quant_type<span class="token punctuation">,</span> use_nested_quant<span class="token punctuation">,</span> device_map<span class="token punctuation">)</span><span class="token punctuation">:</span>
compute_dtype <span class="token operator">=</span> <span class="token builtin">getattr</span><span class="token punctuation">(</span>torch<span class="token punctuation">,</span> bnb_4bit_compute_dtype<span class="token punctuation">)</span>
bnb_config <span class="token operator">=</span> BitsAndBytesConfig<span class="token punctuation">(</span>
load_in_4bit<span class="token operator">=</span>use_4bit<span class="token punctuation">,</span>
bnb_4bit_quant_type<span class="token operator">=</span>bnb_4bit_quant_type<span class="token punctuation">,</span>
bnb_4bit_compute_dtype<span class="token operator">=</span>compute_dtype<span class="token punctuation">,</span>
bnb_4bit_use_double_quant<span class="token operator">=</span>use_nested_quant<span class="token punctuation">,</span>
<span class="token punctuation">)</span>
model <span class="token operator">=</span> AutoModelForCausalLM<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>
pretrained_model_name_or_path<span class="token operator">=</span>pretrained_model_path<span class="token punctuation">,</span>
quantization_config<span class="token operator">=</span>bnb_config<span class="token punctuation">,</span>
device_map<span class="token operator">=</span>device_map
<span class="token punctuation">)</span>
model<span class="token punctuation">.</span>config<span class="token punctuation">.</span>use_cache <span class="token operator">=</span> <span class="token boolean">False</span>
model<span class="token punctuation">.</span>config<span class="token punctuation">.</span>pretraining_tp <span class="token operator">=</span> <span class="token number">1</span>
tokenizer <span class="token operator">=</span> AutoTokenizer<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>pretrained_model_name_or_path<span class="token operator">=</span>pretrained_model_path<span class="token punctuation">,</span>
padding_side<span class="token operator">=</span><span class="token string">"right"</span><span class="token punctuation">,</span>
trust_remote_code<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
tokenizer<span class="token punctuation">.</span>pad_token <span class="token operator">=</span> tokenizer<span class="token punctuation">.</span>eos_token
<span class="token keyword">return</span> model<span class="token punctuation">,</span> tokenizer</code></pre></div>
<p>Then we leveraged the <a href="https://huggingface.co/docs/trl/index" target="_blank" rel="nofollow noopener noreferrer">TRL</a> library’s
Supervised Fine-tuning Trainer (SFTTrainer) to efficiently adapt the model to
our target domain. The SFTTrainer provides a simple API for text generation:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_model</span><span class="token punctuation">(</span>model<span class="token punctuation">,</span> train_dataset<span class="token punctuation">,</span> valid_dataset<span class="token punctuation">,</span> lora_config<span class="token punctuation">,</span> tokenizer<span class="token punctuation">,</span> training_args<span class="token punctuation">,</span> model_adapter_out_path<span class="token punctuation">)</span><span class="token punctuation">:</span>
trainer <span class="token operator">=</span> SFTTrainer<span class="token punctuation">(</span>
model<span class="token operator">=</span>model<span class="token punctuation">,</span>
train_dataset<span class="token operator">=</span>train_dataset<span class="token punctuation">,</span>
eval_dataset<span class="token operator">=</span>valid_dataset<span class="token punctuation">,</span>
peft_config<span class="token operator">=</span>lora_config<span class="token punctuation">,</span>
dataset_text_field<span class="token operator">=</span><span class="token string">"text"</span><span class="token punctuation">,</span>
tokenizer<span class="token operator">=</span>tokenizer<span class="token punctuation">,</span>
args<span class="token operator">=</span>training_args<span class="token punctuation">,</span>
<span class="token punctuation">)</span>
cleanup_incomplete_checkpoints<span class="token punctuation">(</span>training_args<span class="token punctuation">.</span>output_dir<span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>CheckpointCallback<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>log_model<span class="token operator">=</span><span class="token string">"all"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> <span class="token keyword">not</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span>training_args<span class="token punctuation">.</span>output_dir<span class="token punctuation">)</span><span class="token punctuation">:</span>
trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">else</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Resuming from checkpoint..."</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span>resume_from_checkpoint<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>model<span class="token punctuation">.</span>save_pretrained<span class="token punctuation">(</span>model_adapter_out_path<span class="token punctuation">)</span></code></pre></div>
<p>The quantized model can then be efficiently fine-tuned on much less capable
hardware while retaining almost the same level of accuracy. By leveraging the
pretrained model, tokenization, and efficient training techniques, we were able
to effectively customize the model for our use case with far less resources than
training from scratch. The pieces fit together nicely to enable state-of-the-art
results on a budget.</p>
<h2 id="dvc-define-ml-pipeline" style="position:relative;">DVC: Define ML Pipeline<a href="#dvc-define-ml-pipeline" aria-label="dvc define ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Writing the code to efficiently fine-tune a large language model is only part of
the story. We also need to define a reproducible pipeline that can be run
multiple times with different parameters and hyperparameters. This is where DVC
comes in. Below are the stages of the pipeline defined in
<a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a>:</p>
<ul>
<li><code>generate_identity_data</code>: Generates a small subset of hardcoded conversational
data about the model’s identity, creators, etc. saved to
<code>identity_subset.jsonl</code>.</li>
<li><code>process_orca_data</code>: Takes a subset of the
<a href="https://huggingface.co/datasets/Open-Orca/OpenOrca" target="_blank" rel="nofollow noopener noreferrer">Open Orca</a> dataset and
converts it to the prompt/completion format, saving to
<code>orca_processed_subset.jsonl</code>.</li>
<li><code>process_platypus_data</code>: Similarly processes a subset of the
<a href="https://huggingface.co/datasets/garage-bAInd/Open-Platypus" target="_blank" rel="nofollow noopener noreferrer">Open Platypus</a>
dataset.</li>
<li><code>data_split</code>: Splits each of the 3 processed dataset files into
train/validation sets.</li>
<li><code>merge_data</code>: Concatenates all the train splits and all the validation splits
into final <code>train.jsonl</code> and <code>val.jsonl</code>.</li>
<li><code>train</code>: Fine-tunes a Llama-2 model on the training data using the
<a href="https://github.com/huggingface/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a> library and
<a href="https://huggingface.co/docs/trl/main/en/sft_trainer" target="_blank" rel="nofollow noopener noreferrer">Supervised Fine-tuning Trainer</a>.
Saves fine-tuned model adapters.</li>
<li><code>merge_model</code>: Merges the fine-tuned adapter back into the original Llama-2
model.</li>
<li><code>sanity_check</code>: Runs a few prompts through the original and fine-tuned model
for a quick sanity check.</li>
</ul>
<p><img src="https://dvc.org/2023-09-08/dvc_dag-8bdb088e1c1034e15f82cca73f3f4360.svg" alt="DVC pipeline DAG"></p>
<p>The
<a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/params.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>params.yaml</code></a>
file contains the project’s configuration values and training hyperparameters.</p>
<p>You can try a larger model by changing the
<a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/params.yaml#L15" target="_blank" rel="nofollow noopener noreferrer"><code>train.model_size</code></a>
parameter to <code>13b</code> (you might need to either request a larger instance or reduce
the batch size to fit in GPU memory).</p>
<h2 id="skypilot-run-everything-in-cloud" style="position:relative;">SkyPilot: Run everything in Cloud<a href="#skypilot-run-everything-in-cloud" aria-label="skypilot run everything in cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You can either develop the project and run experiments interactively in the
cloud inside VS Code, or submit a run job to the cloud and pull the results to
your local machine.</p>
<h3 id="developing-and-running-experiments-interactively-in-the-cloud" style="position:relative;">Developing and Running Experiments Interactively in the Cloud<a href="#developing-and-running-experiments-interactively-in-the-cloud" aria-label="developing and running experiments interactively in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To launch a cloud instance for interactive development, run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky launch <span class="token parameter variable">-c</span> vscode <span class="token parameter variable">-i</span> <span class="token number">60</span> sky-vscode.yaml</code></pre></div>
<p>This SkyPilot command will launch a
<a href="https://code.visualstudio.com/docs/remote/tunnels" target="_blank" rel="nofollow noopener noreferrer">VS Code tunnel</a> to the cloud
instance.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># sky-vscode.yaml</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> sky<span class="token punctuation">-</span>vscode
<span class="token key atrule">resources</span><span class="token punctuation">:</span>
<span class="token key atrule">accelerators</span><span class="token punctuation">:</span> A10G<span class="token punctuation">:</span><span class="token number">1</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span> aws
<span class="token key atrule">use_spot</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">workdir</span><span class="token punctuation">:</span> .
<span class="token key atrule">file_mounts</span><span class="token punctuation">:</span>
<span class="token key atrule">~/.ssh/id_rsa</span><span class="token punctuation">:</span> ~/.ssh/id_rsa
<span class="token key atrule">~/.ssh/id_rsa.pub</span><span class="token punctuation">:</span> ~/.ssh/id_rsa.pub
<span class="token key atrule">~/.gitconfig</span><span class="token punctuation">:</span> ~/.gitconfig
<span class="token key atrule">setup</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
...
pip install -r requirements.txt
sudo snap install --classic code
...</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
code tunnel --accept-server-license-terms</span></code></pre></div>
<p>Once the tunnel is created, you can open the VS Code instance in your browser by
clicking the link in the terminal output.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a992ce025f1618d1bd2e516774f9dd4b/39600/vscode_tunnel.png" alt="VS Code Tunnel" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="submitting-experiment-jobs-to-the-cloud" style="position:relative;">Submitting Experiment Jobs to the Cloud<a href="#submitting-experiment-jobs-to-the-cloud" aria-label="submitting experiment jobs to the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When you are ready to launch a long-running training job, run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky launch <span class="token parameter variable">-c</span> train --use-spot <span class="token parameter variable">-i</span> <span class="token number">30</span> <span class="token parameter variable">--down</span> sky-training.yaml</code></pre></div>
<p>This SkyPilot command uses spot instances to save costs and automatically
terminates the instance after 30 minutes of idleness. Once the experiment is
complete, its artifacts such as model weights and metrics are stored in your
bucket (thanks to the <a href="https://dvc.org/doc/command-reference/exp/push"><code>dvc exp push origin</code></a> command in <code>sky-training.yaml</code>).</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># sky-training.yaml</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> sky<span class="token punctuation">-</span>training
<span class="token key atrule">resources</span><span class="token punctuation">:</span>
<span class="token key atrule">accelerators</span><span class="token punctuation">:</span> A10G<span class="token punctuation">:</span><span class="token number">1</span>
<span class="token key atrule">cpus</span><span class="token punctuation">:</span> <span class="token number">8</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span> aws
<span class="token key atrule">disk_size</span><span class="token punctuation">:</span> <span class="token number">1024</span>
<span class="token key atrule">workdir</span><span class="token punctuation">:</span> .
<span class="token key atrule">file_mounts</span><span class="token punctuation">:</span>
<span class="token key atrule">~/.ssh/id_rsa</span><span class="token punctuation">:</span> ~/.ssh/id_rsa
<span class="token key atrule">~/.ssh/id_rsa.pub</span><span class="token punctuation">:</span> ~/.ssh/id_rsa.pub
<span class="token key atrule">~/.gitconfig</span><span class="token punctuation">:</span> ~/.gitconfig
<span class="token key atrule">setup</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install --upgrade pip
pip install -r requirements.txt</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
dvc exp run --pull
dvc exp push origin</span></code></pre></div>
<p>While the model is training you can monitor the logs by running the following
command.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky logs train
<span class="token punctuation">..</span>.
<span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">52</span>%<span class="token operator">|</span>█████▏ <span class="token operator">|</span> <span class="token number">28</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">0</span><</span>01:01, <span class="token number">2</span>.38s/it<span class="token punctuation">]</span>
<span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">54</span>%<span class="token operator">|</span>█████▎ <span class="token operator">|</span> <span class="token number">29</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">2</span><</span>00:56, <span class="token number">2</span>.28s/it<span class="token punctuation">]</span>
<span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">56</span>%<span class="token operator">|</span>█████▌ <span class="token operator">|</span> <span class="token number">30</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">5</span><</span>00:57, <span class="token number">2</span>.39s/it<span class="token punctuation">]</span>
<span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">57</span>%<span class="token operator">|</span>█████▋ <span class="token operator">|</span> <span class="token number">31</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">8</span><</span>01:01, <span class="token number">2</span>.67s/it<span class="token punctuation">]</span>
<span class="token punctuation">..</span>.</code></pre></div>
<p>Then, you can pull the results of the experiment to your local machine by
running:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc exp pull origin</code></pre></div>
<h3 id="customizing-the-cloud-instance-and-parameters" style="position:relative;">Customizing the Cloud Instance and Parameters<a href="#customizing-the-cloud-instance-and-parameters" aria-label="customizing the cloud instance and parameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>
<p>You can change the cloud provider and instance type in the <code>resources</code> section
of <code>sky-training.yaml</code> or <code>sky-vscode.yaml</code>.</p>
</li>
<li>
<p>To enable
<a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">DVC Studio integration</a>,
for real-time monitoring of metrics and plots, add the
<code>--env DVC_STUDIO_TOKEN</code> option to the <code>sky launch</code> commands above.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0026972baf62460460e11e3e6d2b56d1/39600/dvc_studio.png" alt="DVC Studio integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
</li>
<li>
<p>To enable <a href="https://wandb.ai/" target="_blank" rel="nofollow noopener noreferrer">Weights & Biases</a> integration, add the
<code>--env WANDB_API_KEY</code> option to the <code>sky launch</code> commands above.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a2fec1735a25b18227687762225dab5d/39600/wandb.png" alt="Weights & Biases integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
</li>
</ul>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this post, we walked through an end-to-end production ML pipeline for
fine-tuning large language models using several key technologies:</p>
<ul>
<li>DVC for reproducible pipelines and efficient dataset versioning</li>
<li>SkyPilot for launching cloud compute resources on demand</li>
<li>HuggingFace Transformers and other libraries for efficient transformer model
training</li>
<li>Quantization techniques like PEFT and QLoRA for reduced precision and memory
usage</li>
</ul>
<p>We used the everything-as-code (EaC) approach of centralizing code, datasets,
hyperparameters, model weights, training infrastructure and development
environment in a git repository. Even the most subtle changes to the training
setup will be recorded in the git history.</p>
<p>We started with a pretrained Llama-2 model and used <code>bitsandbytes</code> to quantize
it for 4-bit precision. Then, we leveraged the TRL library’s Supervised
Fine-tuning Trainer with PEFT for efficient domain-specific fine-tuning.</p>
<p>The resulting pipeline enables state-of-the-art LLM capabilities to be
customized for a target use case with modest compute requirements. DVC and
SkyPilot enabled this to be built as a reproducible ML workflow using cloud
resources efficiently.</p>
<p>This demonstrates how proper MLOps tooling and techniques can make large
language model fine-tuning achievable even with limited resources. The modular
design also makes it easy to swap components like the model architecture,
training method, or cloud provider.</p>
<h3 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li><a href="https://huggingface.co/blog/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware</a></li>
<li><a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" target="_blank" rel="nofollow noopener noreferrer">Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA</a></li>
<li><a href="https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications" target="_blank" rel="nofollow noopener noreferrer">Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications</a></li>
<li><a href="https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html" target="_blank" rel="nofollow noopener noreferrer">Fine-Tune Your Own Llama 2 Model in a Colab Notebook</a></li>
<li><a href="https://blog.skypilot.co/finetuning-llama2-operational-guide/" target="_blank" rel="nofollow noopener noreferrer">Finetuning Llama 2 in your own cloud environment, privately</a></li>
</ul>https://dvc.org/blog/sagemaker-model-deploymenthttps://dvc.org/blog/sagemaker-model-deploymentWed, 30 Aug 2023 00:00:00 GMT<p>Amazon SageMaker from AWS is a popular platform for deploying Machine Learning
models, showing up in almost all search results for the “best ML deployment
platforms today.” So no doubt we’ve had many users ask us how they can deploy
their models to SageMaker. If you would also like some help with this, you are
in the right place.</p>
<p>With DVC pipelines and live metrics tracking using DVCLive and DVC Studio,
iterating on your Machine Learning experiments is a simple process. And DVC
Model Registry makes logging, tracking and deploying your trained models equally
simple. In this article, we’ll walk you through how you can create a training
pipeline that saves your trained models to AWS S3, and how you can then deploy
the models to different environments in SageMaker automatically!</p>
<p>Interested in the final output right now? <a href="https://github.com/iterative/example-get-started-experiments/" target="_blank" rel="nofollow noopener noreferrer">Here’s the code</a>.</p>
<h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>To follow along, you’ll need to provision the following resources in AWS:</p>
<ul>
<li>An S3 bucket for saving your models</li>
<li>Credentials with write access to the above S3 bucket. You’ll need this during
training to save the models.</li>
<li>AWS role with <code>AmazonS3FullAccess</code> and <code>AmazonSageMakerFullAccess</code> for reading
the model files and deploying them to SageMaker.</li>
</ul>
<h1 id="first-why-dvc--sagemaker" style="position:relative;">First, why DVC + SageMaker?<a href="#first-why-dvc--sagemaker" aria-label="first why dvc sagemaker permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>DVC provides a unified way to manage your experiments, datasets, models and
code. It works on top of Git, enabling you to apply the best software
engineering and DevOps practices to your Machine Learning projects. It is also
platform agnostic, which means you have full control over the choice of cloud
services. And with a range of options for model deployment, including real-time
and serverless endpoints, SageMaker is a great choice for hosting models of
different sizes and inference frequencies.</p>
<h1 id="prequel-dvc-push-to-save-the-models-during-training" style="position:relative;">Prequel: <code>DVC push</code> to save the models during training<a href="#prequel-dvc-push-to-save-the-models-during-training" aria-label="prequel dvc push to save the models during training permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>DVC simplifies setting up <a href="https://dvc.org/doc/user-guide/pipelines/defining-pipelines" target="_blank" rel="nofollow noopener noreferrer">reproducible pipelines</a> that automatically save your
model files during model training. Each stage in a DVC pipeline represents a
distinct step in the training process. For each stage, you can specify
hyperparameters and other dependencies, such as datasets or outputs of previous
stages. You can also specify the outputs of each stage, such as metrics, plots,
models, and other files. Learn more <a href="https://dvc.org/doc/command-reference/stage/add" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h2 id="create-a-model-file" style="position:relative;">Create a model file<a href="#create-a-model-file" aria-label="create a model file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The <a href="https://github.com/iterative/example-get-started-experiments/blob/main/dvc.yaml#L33" target="_blank" rel="nofollow noopener noreferrer"><code>sagemaker</code> stage</a> of our pipeline creates a tar file (<code>model.tar.gz</code>) of
our trained model. We then mark this tar file as an output of the stage:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc stage <span class="token function">add</span> <span class="token parameter variable">-n</span> sagemaker … <span class="token parameter variable">-o</span> model.tar.gz …</code></pre></div>
<p>Note that it is not essential to create a separate <code>sagemaker</code> stage like we
did. You could also create the tar file as part of <code>train</code> or any other relevant
stage. In fact, you could even use the approach without a DVC pipeline, by
simply <a href="https://dvc.org/doc/command-reference/add" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>ing the model files or logging them with the <a href="https://dvc.org/doc/dvclive/live/log_artifact" target="_blank" rel="nofollow noopener noreferrer">DVCLive
<code>log_artifact()</code> method</a>. But we recommend using a DVC pipeline for easy
reproducibility of your ML experiments.</p>
<h2 id="configure-dvc-remote" style="position:relative;">Configure DVC remote<a href="#configure-dvc-remote" aria-label="configure dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Additionally, we’ve configured the default <a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">DVC remote</a> to be our s3 bucket:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc remote <span class="token function">add</span> <span class="token parameter variable">-d</span> storage s3://dvc-public/remote/get-started-pools</code></pre></div>
<p>This means that whenever we run <a href="https://dvc.org/doc/command-reference/push#push" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, the updated model tar file is
pushed to the s3 bucket.</p>
<h2 id="run-the-pipeline-to-save-the-model-in-s3" style="position:relative;">Run the pipeline to save the model in S3<a href="#run-the-pipeline-to-save-the-model-in-s3" aria-label="run the pipeline to save the model in s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, every time we run our training pipeline an updated model tarfile is
generated, and we <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> it to the remote S3 bucket. By storing large files
like the model tar file in remote storages such as s3, DVC makes it possible to
track them in Git, maintaining Git as the single source of truth for your
projects.</p>
<h1 id="track-and-manage-model-versions-in-dvc-model-registry" style="position:relative;">Track and manage model versions in DVC model registry<a href="#track-and-manage-model-versions-in-dvc-model-registry" aria-label="track and manage model versions in dvc model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Our training script <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/train.py#L72" target="_blank" rel="nofollow noopener noreferrer">logs our model</a> using the <a href="https://dvc.org/doc/dvclive/live/log_artifact" target="_blank" rel="nofollow noopener noreferrer">DVCLive <code>log_artifact()</code>
method</a>, which creates an <a href="https://github.com/iterative/example-get-started-experiments/blob/main/results/train/dvc.yaml#L8" target="_blank" rel="nofollow noopener noreferrer">artifact entry</a> of type <code>model</code> in a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">artifacts</span><span class="token punctuation">:</span>
<span class="token key atrule">pool-segmentation</span><span class="token punctuation">:</span>
<span class="token key atrule">path</span><span class="token punctuation">:</span> ../../models/model.pkl
<span class="token key atrule">type</span><span class="token punctuation">:</span> model
<span class="token punctuation">...</span></code></pre></div>
<p>Because of this, when we <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/create-a-project#connect-to-a-git-repository-and-add-a-project" target="_blank" rel="nofollow noopener noreferrer">add the project to DVC Studio</a>, the model appears in
the <a href="https://studio.datachain.ai/user/~/models" target="_blank" rel="nofollow noopener noreferrer">model registry</a>.</p>
<p>Note that there are other ways to register the model in the model registry - you
can <a href="https://dvc.org/doc/studio/user-guide/model-registry/add-a-model" target="_blank" rel="nofollow noopener noreferrer">add the model from the Studio UI</a> or manually add it to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>
file.</p>
<p>Once the model is registered in the model registry, you can assign version
numbers every time your ML experiment produces a model version that you like.
Use the <a href="https://dvc.org/doc/studio/user-guide/model-registry/register-version" target="_blank" rel="nofollow noopener noreferrer"><code>Register version</code></a> option to select the Git commit for the experiment
which produced the desired model version, and assign it a <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer">semantic version</a>.
Every version registration is saved using specially formatted Git tags, which
you can find in the <a href="https://github.com/iterative/example-get-started-experiments/tags" target="_blank" rel="nofollow noopener noreferrer">Git repository</a>.</p>
<p><img src="https://dvc.org/2023-08-30/mr-register-version-94e709a5988cb2ef17681de3288d5803.gif" alt="Version registration in the DVC Model Registry"><em>Version
registration in the DVC Model Registry</em></p>
<h1 id="trigger-model-deployment-with-stage-assignments" style="position:relative;">Trigger model deployment with stage assignments<a href="#trigger-model-deployment-with-stage-assignments" aria-label="trigger model deployment with stage assignments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>So far, you have saved your model versions in your Git repository (as Git tags)
and the actual model tar files in S3. Suppose you just registered version
<code>1.0.0</code> of your model, and would like to deploy it to your <code>dev</code> environment so
that you and your team can evaluate its performance. The model registry
simplifies this too, by providing a mechanism to assign stages to model versions
and creating specially formatted Git tags representing this action.</p>
<p><img src="https://dvc.org/2023-08-30/mr-assign-stage-0bf0129f7d342729c08aa42acd95e1b2.gif" alt="Stage assignment in the DVC Model Registry"><em>Stage
assignment in the DVC Model Registry</em></p>
<p>Since stage assignment also creates Git tags, you can write a <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml" target="_blank" rel="nofollow noopener noreferrer">CI/CD action that
runs on Git tag push</a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">on</span><span class="token punctuation">:</span>
<span class="token key atrule">push</span><span class="token punctuation">:</span>
<span class="token key atrule">tags</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token string">'results/train=pool-segmentation#*'</span></code></pre></div>
<p>This action parses the Git tags to determine the model, version and stage. DVC
model registry internally uses <a href="https://mlem.ai/doc/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> to save version registrations and stage
assignments, and the <a href="https://github.com/iterative/gto-action" target="_blank" rel="nofollow noopener noreferrer">Iterative GTO action</a> can be used in your <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml" target="_blank" rel="nofollow noopener noreferrer">GitHub actions
workflow</a> to parse the Git tags:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/gto<span class="token punctuation">-</span>action@v2</code></pre></div>
<p>This action produces the outputs shown below:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outputs</span><span class="token punctuation">:</span>
<span class="token key atrule">event</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.event <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token comment"># whether the event is a version registration or a stage assignment</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.name <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token comment"># model name</span>
<span class="token key atrule">stage</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.stage <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">version</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.version <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div>
<p>This action is available only in GitHub though; if you’re using GitLab,
Bitbucket or some other provider, you can use the <a href="https://mlem.ai/doc/gto/command-reference/check-ref" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/gto/command-reference/check-ref"><code>gto check-ref</code></a>
command to parse the Git tags, which follow <a href="https://mlem.ai/doc/gto/user-guide#git-tags-format" target="_blank" rel="nofollow noopener noreferrer">this format</a>.</p>
<p>Now, whenever you <a href="https://dvc.org/doc/studio/user-guide/model-registry/assign-stage" target="_blank" rel="nofollow noopener noreferrer"><code>Assign stage</code></a> to a model version, your CI/CD action
understands which version of which model was assigned which stage. Then, it can
use the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml#L64" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/get#-url"><code>dvc get –show-url</code></a> command to determine the S3 path of the tar file
for the model version.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml">MODEL_DATA=$(dvc get <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>url . model.tar.gz)</code></pre></div>
<p>Finally, it can invoke the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py" target="_blank" rel="nofollow noopener noreferrer">deployment script</a> with appropriate inputs.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python sagemaker/deploy_model.py <span class="token punctuation">\</span>
<span class="token parameter variable">--name</span> <span class="token variable">${{ needs.parse.outputs.name }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--stage</span> <span class="token variable">${{ needs.parse.outputs.stage }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--version</span> <span class="token variable">${{ needs.parse.outputs.version }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--model_data</span> <span class="token variable">$MODEL_DATA</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--role</span> <span class="token variable">${{ secrets.AWS_ROLE_TO_ASSUME }</span><span class="token punctuation">}</span></code></pre></div>
<p>This automates the model deployment process, which is very helpful if your model
is expected to evolve constantly.</p>
<p>Next, we will explain the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py" target="_blank" rel="nofollow noopener noreferrer">deployment script</a>.</p>
<h1 id="deploy-the-model-to-sagemaker-and-run-inference" style="position:relative;">Deploy the model to SageMaker and run inference<a href="#deploy-the-model-to-sagemaker-and-run-inference" aria-label="deploy the model to sagemaker and run inference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>So far, you’ve seen how you can</p>
<p>✅ create and run reproducible pipelines that save the model to S3,</p>
<p>✅ track and manage model versions in a web model registry, and</p>
<p>✅ assign stages to trigger model deployment.</p>
<p>The last step above specifies which model version should be deployed to which
environment. Now let’s see how to actually</p>
<p>🔲 deploy the model, and</p>
<p>🔲 run inference on it.</p>
<p>A deployment in SageMaker is called an endpoint. When you deploy your model, you
create or update an endpoint. And for running inference, you invoke the
endpoint.</p>
<p><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html" target="_blank" rel="nofollow noopener noreferrer">There are a few different ways to do the actual
deployment</a>, including the <a href="https://sagemaker.readthedocs.io/en/stable/overview.html" target="_blank" rel="nofollow noopener noreferrer">SageMaker Python SDK</a>
and the <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html" target="_blank" rel="nofollow noopener noreferrer">boto3 library</a>. We have chosen to use the SageMaker Python SDK, which
has a two-step process for deployment:</p>
<ul>
<li>create the SageMaker model bundle (<a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L38" target="_blank" rel="nofollow noopener noreferrer">click to see the
code</a>), and</li>
<li>create the endpoint (<a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L54" target="_blank" rel="nofollow noopener noreferrer">click to see the code</a>).</li>
</ul>
<p>Note that if you do not expect your model to be constantly used for inference,
you can create a serverless inference endpoint by specifying a <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L58" target="_blank" rel="nofollow noopener noreferrer">serverless
inference config</a> (learn about the <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-options" target="_blank" rel="nofollow noopener noreferrer">different inference options</a>).</p>
<p>Once deployed, the endpoint status becomes <code>InService</code> in the AWS console.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dca163debc81f0270e78249320449300/39600/aws-sagemaker-endpoints.png" alt="InService SageMaker Endpoint" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>InService
SageMaker Endpoint in the AWS console</em></p>
<h2 id="run-inference" style="position:relative;">Run inference<a href="#run-inference" aria-label="run inference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now that your SageMaker deployment is ready, you can run inference using the
<a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/endpoint_prediction.py#L35" target="_blank" rel="nofollow noopener noreferrer">SageMaker predictor</a> (for boto3, use <a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html" target="_blank" rel="nofollow noopener noreferrer"><code>invoke_endpoint()</code></a>). <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/endpoint_prediction.py" target="_blank" rel="nofollow noopener noreferrer">Here is an
inference script</a> that pre-processes your input, calls
inference, and applies the result mask to the input image to create the output
image, and saves the result.</p>
<p>Run this script with the following command:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python src/endpoint_prediction.py <span class="token punctuation">\</span>
<span class="token parameter variable">--img</span> <span class="token operator"><</span>jpg-file-path<span class="token operator">></span> <span class="token punctuation">\</span>
<span class="token parameter variable">--endpoint_name</span> <span class="token operator"><</span>endpoint-name<span class="token operator">></span> <span class="token punctuation">\</span>
<span class="token parameter variable">--output_path</span> <span class="token operator"><</span>output-folder<span class="token operator">></span></code></pre></div>
<p>Here's my input image:
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 355px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cde059626f1aaf9ad7f2bfa14c0d9a8c/03346/input-image.jpg" alt="Input image" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>And the output identifying the swimming pools:
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 355px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/83fdee51e0a59d67a564fef03e05d512/39600/output-image.png" alt="Output image" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h1 id="now-your-turn" style="position:relative;">Now, your turn!<a href="#now-your-turn" aria-label="now your turn permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Let us know (reach out in <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>) if you run into any issues when trying to
deploy your own model to SageMaker. We will be more than happy to help you
figure it out!</p>https://dvc.org/blog/dvc-3.0-ml-experiments-data-versioninghttps://dvc.org/blog/dvc-3.0-ml-experiments-data-versioningWed, 14 Jun 2023 00:00:00 GMT<p><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">DVC 3.0</a> helps you <a href="#experiment-tracking-and-beyond">experiment</a>, from notebook
exploration to model management, and works smarter with your
<a href="#smarter-cloudremote-storage">cloud/remote storage</a> to make data versioning
painless.</p>
<h2 id="experiment-tracking-and-beyond" style="position:relative;">Experiment Tracking and Beyond<a href="#experiment-tracking-and-beyond" aria-label="experiment tracking and beyond permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In <a href="https://iterative.ai/blog/dvc-2-0-release" target="_blank" rel="nofollow noopener noreferrer">DVC 2.0</a>, we first released DVC experiments, providing a way to track
experiments as hidden, <a href="https://iterative.ai/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">lightweight Git commits</a>, so you don't have to
separately manage your experiments and code. Now it's easier to <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">start tracking
experiments</a> from your Python script or notebook (see examples). You only need a
Git repo and DVC's Python logging library <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>. You don't need prior DVC
knowledge or an existing DVC project.</p>
<toggle>
<tab title="Pytorch Lightning">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>lightning <span class="token keyword">import</span> DVCLiveLogger
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
trainer <span class="token operator">=</span> Trainer<span class="token punctuation">(</span>logger<span class="token operator">=</span>DVCLiveLogger<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>model<span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="Hugging Face">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>huggingface <span class="token keyword">import</span> DVCLiveCallback
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="Keras">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DVCLiveCallback
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>
train_dataset<span class="token punctuation">,</span> validation_data<span class="token operator">=</span>validation_dataset<span class="token punctuation">,</span>
callbacks<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="General Python API">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live
<span class="token keyword">with</span> Live<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_param<span class="token punctuation">(</span><span class="token string">"epochs"</span><span class="token punctuation">,</span> NUM_EPOCHS<span class="token punctuation">)</span>
<span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>NUM_EPOCHS<span class="token punctuation">)</span><span class="token punctuation">:</span>
train_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span>
metrics <span class="token operator">=</span> evaluate_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> metric_name<span class="token punctuation">,</span> value <span class="token keyword">in</span> metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span>
live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
</tab>
</toggle>
<p>With the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>, you get an experiment tracking workbench
without any servers or logins. Your experiments are also available in our
collaboration hub <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio</a> and connected to your Git repo automatically, so you
can share, review and merge like you would with code. You can work locally when
you want and use Studio to share if and when it suits you, just like in Git.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/u-URI5Lvc-g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="model-management" style="position:relative;">Model Management<a href="#model-management" aria-label="model management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>With the <a href="https://dvc.org/doc/studio/user-guide/model-registry" target="_blank" rel="nofollow noopener noreferrer">Studio Model Registry</a>, you can use DVC to manage your entire model
lifecycle inside your Git workflow, from creating the model to deploying it in
any deployment system. Our ethos for model management is consistent with
everything else we do - It's all about integrating with your existing stack and
tools, and empowering you to tie your workflows around GitOps principles and
automation.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/wX0KBg8EU5Y?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="cloud-experiments-alpha-release" style="position:relative;">Cloud Experiments (Alpha Release)<a href="#cloud-experiments-alpha-release" aria-label="cloud experiments alpha release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When we released DVC 2.0, we also launched the <a href="https://iterative.ai/blog/dvc-2-0-release#new-method-to-provision-cloud-compute-in-new-cml-release" target="_blank" rel="nofollow noopener noreferrer"><code>cml runner</code></a>
command to run continuous integration (CI) on your own cloud instances so you
could automate large ML jobs. Cloud experiments build on this technology without
CI, meaning less setup (you can configure directly in Studio). With the alpha
release of <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/run-experiments#cloud-experiments" target="_blank" rel="nofollow noopener noreferrer">Studio Cloud Experiments</a>, you can run DVC experiments on your own
cloud infrastructure in a few clicks, including with GPU and spot instance
support.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/MF5k-qLUiAg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="hyperparameter-optimization" style="position:relative;">Hyperparameter Optimization<a href="#hyperparameter-optimization" aria-label="hyperparameter optimization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC can also help you do hyperparameter optimization by integrating with other
tools. You can <a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments" target="_blank" rel="nofollow noopener noreferrer">queue</a> an entire grid search of experiments, configure multiple
complex model architectures with <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra</a> integration, and track your <a href="https://dvc.org/doc/dvclive/ml-frameworks/optuna" target="_blank" rel="nofollow noopener noreferrer">Optuna</a>
studies.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EpzUqvtvZ4c?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="smarter-cloudremote-storage" style="position:relative;">Smarter Cloud/Remote Storage<a href="#smarter-cloudremote-storage" aria-label="smarter cloudremote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are committed to building the best data versioning experience. This means
making DVC work with your existing data stack and not trying to replace it. We
have focused on working more closely with cloud storage (and non-cloud storage)
by making DVC not only faster but smarter.</p>
<h3 id="minimizing-downloads" style="position:relative;">Minimizing Downloads<a href="#minimizing-downloads" aria-label="minimizing downloads permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Avoiding unnecessary downloads saves time and space that could never be
accomplished by transfer speedups alone. You can now <a href="https://dvc.org/doc/user-guide/data-management/modifying-large-datasets" target="_blank" rel="nofollow noopener noreferrer">add or modify</a> individual
files in a larger dataset. If you have a large dataset in remote storage, you
can pull and modify any file without needing to download the full dataset.</p>
<p><img src="https://dvc.org/2023-06-14/dvc-part-update-f045f6d718b6d1a25267598a06ef0558.gif" alt="partial-add" title="Add or modify files in a dataset."></p>
<p>You can also run or verify a pipeline <a href="https://dvc.org/doc/user-guide/pipelines/running-pipelines#pull-missing-data" target="_blank" rel="nofollow noopener noreferrer">without pulling data</a> first. You can skip
downloading data for stages that haven't changed and automatically download only
the data needed for stages that have changed.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/CuorzMAUbgU?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="cloud-versioning" style="position:relative;">Cloud Versioning<a href="#cloud-versioning" aria-label="cloud versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You shouldn't have to create extra copies of data that's already backed up and
versioned on the cloud. DVC <a href="https://dvc.org/doc/user-guide/data-management/cloud-versioning" target="_blank" rel="nofollow noopener noreferrer">cloud versioning</a> enables you to import data that's
already versioned by your cloud provider. In the example below, DVC knows not to
push any data to its own storage because it is already versioned by the cloud.
Pulling the data later will recover it from its original source location.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import-url</span> <span class="token parameter variable">--version-aware</span> s3://mybucket/data
</span>Importing 's3://mybucket/data' -> 'data'
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span>
</span>Everything is up to date.</code></pre></div>
<h3 id="pythonic-api" style="position:relative;">Pythonic API<a href="#pythonic-api" aria-label="pythonic api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You may need to work with your cloud data outside of the command-line workflow
of pushing and pulling. The <a href="https://dvc.org/doc/api-reference/dvcfilesystem" target="_blank" rel="nofollow noopener noreferrer">DVCFileSystem</a> API enables you to read and manage
files and directories from remote DVC repos like you would for a local
filesystem. In the example below, each file in the <code>data/prepared</code> directory is
streamed in as text.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> DVCFileSystem
<span class="token operator">>></span><span class="token operator">></span> repo <span class="token operator">=</span> <span class="token string">"https://github.com/iterative/example-get-started.git"</span>
<span class="token operator">>></span><span class="token operator">></span> fs <span class="token operator">=</span> DVCFileSystem<span class="token punctuation">(</span>repo<span class="token punctuation">,</span> rev<span class="token operator">=</span><span class="token string">"main"</span><span class="token punctuation">)</span>
<span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">for</span> f <span class="token keyword">in</span> fs<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"data/prepared"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> text <span class="token operator">=</span> fs<span class="token punctuation">.</span>read_text<span class="token punctuation">(</span>f<span class="token punctuation">)</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># process the data</span></code></pre></div>
<h3 id="faster-performance" style="position:relative;">Faster Performance<a href="#faster-performance" aria-label="faster performance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Sometimes you just need faster performance, especially for large data downloads
and uploads. We have focused on improving performance where it matters most. For
example, pushing data to S3 is 2.5x faster in DVC 3.0 than in early versions of
DVC 2.x according to our benchmarks.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 359px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e6987b7fd311eff2bf6da88e7cc450f0/39600/dvc-push-s3.png" alt="push-s3" title="Time to push to S3." loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h1 id="thank-you" style="position:relative;">Thank You!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Our constant interaction with the DVC community gives us feedback on what should
be improved. We heard from you that the ML landscape is already complex and you
want to keep your tools simple. That's why many of the new "features" are
improvements to existing functionality, and why we are building this stack of
tools to make DVC easier, more flexible, and the solid choice for your MLOps
workflows.</p>
<p>Finally, none of these improvements would be possible without the support of the
teams who work on the entire DVC stack.</p>
<p>Thanks to all of you who make DVC and its community what it is!</p>
<h1 id="get-started-with-the-dvc-30-stack" style="position:relative;">Get Started with the DVC 3.0 Stack<a href="#get-started-with-the-dvc-30-stack" aria-label="get started with the dvc 30 stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Get started with DVC 3.0 or the other tools in the DVC stack:</p>
<ul>
<li><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">DVC 3.0</a></li>
<li><a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio</a></li>
<li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a></li>
</ul>https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-2https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-2Wed, 10 May 2023 00:00:00 GMT<p>In the
<a href="https://iterative.ai/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1/" target="_blank" rel="nofollow noopener noreferrer">previous post</a>,
we discussed how DVC simplifies physical simulation pipelines and data
management. This post discusses how to run simulations in the cloud, run new
experiments, and visualize simulation results with
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> and other tools.</p>
<p>In this post, you will learn how to:</p>
<ol>
<li>
<p>Manage computational resources on AWS and start and shut down EC2 instances
for simulation experiments.</p>
</li>
<li>
<p>Run new <a href="https://www.openfoam.com/" target="_blank" rel="nofollow noopener noreferrer">OpenFOAM</a> simulations in a cloud using
Iterative Studio and <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>.</p>
</li>
<li>
<p>Use Iterative Studio to view simulation results and DVC plots online.</p>
</li>
</ol>
<p>This post is a result of collaboration between the
<a href="http://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.ai</a> and
<a href="https://plasmasolve.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">PlasmaSolve</a> teams. PlasmaSolve was founded
in 2016 by plasma physicists and software engineers to provide a platform for
cutting-edge physics simulation services and research. The PlasmaSolve team
strives to deliver top-notch solutions and well-designed physics simulations to
speed up research and reduce development costs using various open-source and
commercial simulation tools.</p>
<h1 id="run-simulations-in-the-cloud-with-gitlab-and-cml" style="position:relative;">Run simulations in the cloud with GitLab and CML<a href="#run-simulations-in-the-cloud-with-gitlab-and-cml" aria-label="run simulations in the cloud with gitlab and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<admon type="tip">
<p>For this part of the post, we follow the <code>main</code> branch in the
<a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/main" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>.
Please follow the
<a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/blob/main/README.md" target="_blank" rel="nofollow noopener noreferrer">README</a>
to prepare your environment and install dependencies.</p>
</admon>
<p>OpenFOAM simulations can be computationally intensive, requiring access to
high-performance computing resources or a cluster of computers to solve large or
complex problems.</p>
<p>To run the demo simulation in AWS we may apply
<a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">CML (Continuous Machine Learning)</a>. CML can start a new
AWS EC2 instance to run a new simulation experiment and shut it down when it’s
done.</p>
<p>The full configuration for the demo CI pipeline can be found in the
<a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/blob/main/.gitlab-ci.yml" target="_blank" rel="nofollow noopener noreferrer"><code>.gitlab-ci.yml</code></a>
file.</p>
<p>The demo project shows an example of how to integrate CML into GitLab CI
configuration. The pipeline has two stages: <code>build</code> and <code>run</code>. The <code>build</code> stage
has a single job that builds a docker image based on the specified <code>Dockerfile</code>,
pushes the image to Amazon Elastic Container Registry (ECR), and logs in to the
registry. The <code>run</code> stage has three jobs: <code>launch</code>, <code>run</code>, and <code>report</code>. The
<code>launch</code> job launches an EC2 instance on Amazon Web Services (AWS) and the <code>run</code>
job runs a simulation on the instance. The <code>report</code> job generates a report on
the simulation results. Visual representations of the CI pipeline and used AWS
services are shown in the diagram below.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f11f58685a7da2c3a6fa495b06c261d2/39600/architecture.png" alt="CML with Gitlab CI configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>CML
with Gitlab CI configuration</em></p>
<h2 id="using-aws-computational-resources" style="position:relative;">Using AWS computational resources<a href="#using-aws-computational-resources" aria-label="using aws computational resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When a workflow requires computational resources (such as GPUs), CML can
automatically allocate cloud instances using
<a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">cml runner</a>. You can spin up instances on AWS,
Azure, GCP, or Kubernetes
(<a href="https://cml.dev/doc/self-hosted-runners#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">see below</a>).
Alternatively, you can connect to
<a href="https://cml.dev/doc/self-hosted-runners#on-premise-local-runners" target="_blank" rel="nofollow noopener noreferrer">any other computing provider or an on-premise (local) machine</a>.</p>
<p>Below is an example of the GitLab CI <code>launch</code> job configuration that allocates
AWS instances using <code>cml runner</code> command. Users may define the region, instance
type, and storage size that are needed:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">launch</span><span class="token punctuation">:</span>
<span class="token key atrule">stage</span><span class="token punctuation">:</span> run
<span class="token key atrule">rules</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">changes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>dvc.yaml<span class="token punctuation">,</span> params.yaml<span class="token punctuation">,</span> .gitlab<span class="token punctuation">-</span>ci.yml<span class="token punctuation">]</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string">
cml runner launch
--cloud=aws
--cloud-region=$AWS_DEFAULT_REGION --cloud-type=m5.2xlarge
--cloud-hdd-size=32 --labels=cml
--docker-volumes="/home/.cml/cache:/home/.cml/cache"</span></code></pre></div>
<h2 id="setup-ci-jobs-to-run-a-simulation" style="position:relative;">Setup CI jobs to run a simulation<a href="#setup-ci-jobs-to-run-a-simulation" aria-label="setup ci jobs to run a simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To run a new simulation experiment using the <code>cml runner</code> we need to specify the
<code>cml</code> tag in the <code>run</code> job and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">run</span><span class="token punctuation">:</span>
<span class="token key atrule">stage</span><span class="token punctuation">:</span> run
<span class="token key atrule">tags</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>cml<span class="token punctuation">]</span>
<span class="token key atrule">rules</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">changes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>params.yaml<span class="token punctuation">,</span> .gitlab<span class="token punctuation">-</span>ci.yml<span class="token punctuation">]</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span>AWS_CONTAINER_IMAGE<span class="token punctuation">}</span>
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">...</span>
<span class="token comment"># Run an experiment</span>
<span class="token punctuation">-</span> dvc pull <span class="token punctuation">|</span><span class="token punctuation">|</span> echo "Pull failed" <span class="token comment"># Pull outputs of previous simulation if any</span>
<span class="token punctuation">-</span> dvc exp run <span class="token punctuation">-</span>f
<span class="token punctuation">-</span> dvc push <span class="token comment"># Save results</span>
<span class="token punctuation">-</span> rsync <span class="token punctuation">-</span>r ./ /home/.cml/cache/run <span class="token comment"># Share results with 'report' job</span></code></pre></div>
<p>Using <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> command helps to download the results of the previous
experiments from the remote storage. Checking versions of previous results and
DVC pipeline stage dependencies, DVC may skip running stages that do not need to
be run and save a lot of time and computational resources. After the simulation
completes, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> uploads the new results back to the remote storage.</p>
<p>After the <code>run</code> job completes, the <code>report</code> job prepares and publishes the CML
report to the associated Git commit. For this, we need to build a <code>report.md</code>
file with all text & plots in Markdown format, and use the <code>cml comment create</code>
command to publish this report and create a pull request.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">report</span><span class="token punctuation">:</span>
<span class="token punctuation">...</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token comment"># Python, DVC, & CML pre-installed</span>
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">...</span>
<span class="token comment"># Create CML report</span>
<span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string">
cat <<EOF > report.md
...

EOF</span>
<span class="token punctuation">-</span> cml comment create <span class="token punctuation">-</span><span class="token punctuation">-</span>publish<span class="token punctuation">-</span>native report.md
<span class="token punctuation">-</span> cml pr create .</code></pre></div>
<p>In some cases, these reports may help to collaborate with teammates using a Git
workflow.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/20456cd9f4b2b89b286c5eda678f615b/39600/git_report.png" alt="A report posted after the simulation runs in the pull request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>A
report posted after the simulation runs in the Pull Request</em></p>
<h2 id="setup-gitlab-ci-variables" style="position:relative;">Setup GitLab CI variables<a href="#setup-gitlab-ci-variables" aria-label="setup gitlab ci variables permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To run simulations in AWS with GitLab CI & CML, it's recommended to use
provider-managed policies/roles and then explicitly limit the permissions
further if possible.
<a href="https://cml.dev/doc/ref/runner?tab=AWS#common-permissions" target="_blank" rel="nofollow noopener noreferrer">Here is a set of common permissions required by CML</a>.</p>
<p>In this demo we used the following CI variables in the project
<code>Settings → CI/CD → Variables</code>:</p>
<ul>
<li><code>AWS_ACCESS_KEY_ID</code></li>
<li><code>AWS_SECRET_ACCESS_KEY</code></li>
<li><code>AWS_SESSION_TOKEN</code> - it is optional and depends on the AWS organization
settings.</li>
<li><code>REPO_TOKEN</code> - a
<a href="https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html" target="_blank" rel="nofollow noopener noreferrer">personal access token</a>
with the <code>api</code>, <code>read_repository</code> and <code>write_repository</code> scopes. Find more
details in
<a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">CML docs on Personal Access Token</a></li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7fcdd3ce246a70a4c96c8a5f6685807e/39600/ci_vars.png" alt="Examples of CI variables in GitLab" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Examples
of CI variables in GitLab</em></p>
<admon type="tip">
<p>Note: → AWS_SESSION_TOKEN is not required for most users. It’s specific to
Iterative's sandbox account. → REPO_TOKEN - a personal access token with the
api, read_repository and write_repository scopes. Find more details in CML docs
on
<a href="https://cml.dev/doc/self-hosted-runners#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">Personal Access Token.</a></p>
</admon>
<h1 id="experimenting-and-visualization-simulation-results-in-iterative-studio" style="position:relative;">Experimenting and visualization simulation results in Iterative Studio<a href="#experimenting-and-visualization-simulation-results-in-iterative-studio" aria-label="experimenting and visualization simulation results in iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a web application that you
can access online or even host on-prem. Using the power of leading open-source
tools <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>,
and <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, enables you to seamlessly manage data, run and
track experiments, and visualize and share results.</p>
<h2 id="run-a-new-simulation" style="position:relative;">Run a new simulation<a href="#run-a-new-simulation" aria-label="run a new simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Using Iterative Studio we can run new simulation experiments in the Cloud and
visualize results in Studio UI.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/studio-run-new-simulation-5c1a866efc1c8591b67c3b2463c940e2.mp4" type="video/mp4"> Your
browser does not support the video tag. </video><em>Example of running a new
simulation experiment via Iterative Studio</em></p>
<h2 id="visualize-simulation-results" style="position:relative;">Visualize simulation results<a href="#visualize-simulation-results" aria-label="visualize simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Iterative Studio helps to visualize simulation result images and DVC plots just
after the simulation is complete. Studio allows one to plot images and metrics,
and compare them with previous simulations.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/studio-visualize-simulation-results-5c1a866efc1c8591b67c3b2463c940e2.mp4" type="video/mp4"> Your
browser does not support the video tag. </video><em>Example of visualization of
simulation results in Iterative Studio</em></p>
<h1 id="visualize-the-simulation-outputs-with-paraview" style="position:relative;">Visualize the simulation outputs with ParaView<a href="#visualize-the-simulation-outputs-with-paraview" aria-label="visualize the simulation outputs with paraview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>OpenFOAM includes several utilities for visualizing simulation results,
including ParaView, which is a popular open-source visualization tool. Users can
use these tools to generate plots, contour plots, and volume renderings of
simulation results.</p>
<p>DVC can help to download the simulation outputs and visualize them locally. One
could do a simple command to get all the data generated by the simulation:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span></span></code></pre></div>
<p>Downloaded data can be visualized with third-party tools like ParaView.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/ParaView_sonicFoam-388acefd9308949032f41976a4862e26.mp4" type="video/mp4"> Your
browser does not support the video tag. </video> <em>Example for sonicFoam
simulation results visualized in ParaView</em></p>
<h1 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This post details how Iterative tools help in physical and computational
simulations. For this purpose, we created a demo project built with OpenFOAM.
The demo shows how to set up DVC for simulation experiments and data management.
CML is used in the GitLab CI pipeline to manage computational resources on AWS.
Iterative Studio is then used as a UI to visualize simulation results and run
new simulations in a few clicks.</p>
<p>Overall, DVC, CML, and Iterative Studio can help OpenFOAM users:</p>
<ol>
<li>
<p>Reduce the complexity of simulation pipelines and automate tasks such as
running simulations, post-processing results, and generating reports.</p>
</li>
<li>
<p>Manage and track the data and code associated with your OpenFOAM simulations,
and make it easier to reproduce simulation results. Store simulation data
on-premises or in the cloud using a variety of storage types, such as S3.</p>
</li>
<li>
<p>Manage simulation experiments with simple YAML config files.</p>
</li>
<li>
<p>Manage computational resources on AWS and start and shut down EC2 instances
for simulation experiments.</p>
</li>
<li>
<p>Iterative Studio provides a user-friendly interface for simulation results,
visualization, and running new simulations quickly.</p>
</li>
<li>
<p>Iterative Studio allows users to view and share simulation results and DVC
plots online, without the need to download and visualize results locally.</p>
</li>
</ol>
<h1 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<ul>
<li><a href="https://www.simscale.com/blog/openfoam-users-should-try-simscale/" target="_blank" rel="nofollow noopener noreferrer">Why OpenFOAM Users Should Try SimScale</a></li>
<li><a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">OpenFoam - Tutorial Guide: Supersonic flow over a forward-facing step</a></li>
<li><a href="https://openfoamwiki.net/index.php/ScalarTransportFoam" target="_blank" rel="nofollow noopener noreferrer">Introduction to ScalarTransportFoam solver on OpenFoamWiki</a></li>
<li><a href="https://develop.openfoam.com/Development/openfoam/-/tree/master/tutorials/basic/scalarTransportFoam" target="_blank" rel="nofollow noopener noreferrer"><code>scalarTransportFoam</code> Tutorial</a></li>
<li><a href="https://www.researchgate.net/profile/Ingo-Riess/post/How_to_model_smoke_propagation_for_an_existing_velocity_field_using_scalarTransportFoam_in_OpenFOAM/attachment/5cee6f723843b0b98254daac/AS%3A763860613099524%401559129970722/download/5-scalarTransportFoamTutorial.pdf" target="_blank" rel="nofollow noopener noreferrer">Walkthrough and tutorial for <code>scalarTransportFoam</code>: a solver for advection-diffusion of a passive scalar</a>,
<em>Eric Paterson and Kevin T. Crofton Department of Aerospace and Ocean
Engineering Virginia Polytechnic Institute and State University</em></li>
</ul>https://dvc.org/blog/testing-external-contributions-using-github-actions-secretshttps://dvc.org/blog/testing-external-contributions-using-github-actions-secretsThu, 20 Apr 2023 00:00:00 GMT<p>As cloud-native applications become more complex and rely on more third-party
services, testing becomes increasingly difficult. One of the most significant
challenges for open source projects is testing contributions against complex
services that require authentication and are particularly hard to mock.</p>
<p>In this blog post, we will explore a simple method for securely running this
kind of integration tests on external pull requests, using the GitHub Actions
<a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code> trigger</a>
and GitHub
<a href="https://docs.github.com/en/actions/deployment/targeting-different-environments" target="_blank" rel="nofollow noopener noreferrer">environments</a>
to prevent unauthorized runs:</p>
<h2 id="configuration" style="position:relative;">Configuration<a href="#configuration" aria-label="configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ol>
<li>
<p><a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository" target="_blank" rel="nofollow noopener noreferrer">Create some encrypted secrets</a>;
a secret named <strong><code>EXAMPLE</code></strong> will be used to illustrate the next sections.</p>
</li>
<li>
<p><a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#creating-an-environment" target="_blank" rel="nofollow noopener noreferrer">Create an environment</a>
named <code>external</code> and add some trusted GitHub users or
<a href="https://docs.github.com/en/organizations/organizing-members-into-teams/about-teams" target="_blank" rel="nofollow noopener noreferrer">teams</a>
as
<a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#required-reviewers" target="_blank" rel="nofollow noopener noreferrer">required reviewers</a>;
they’ll be responsible for approving every run triggered by external
contributors.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3900aa107f0569453a5d2a73fb3b4d3/03346/environment.jpg" alt="screenshot of environment settings" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
</li>
</ol>
<h2 id="workflow" style="position:relative;">Workflow<a href="#workflow" aria-label="workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote>
<p>⚠️ <strong>Warning</strong>: using the
<a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code></a>
event without the cautionary measures described below may allow unauthorized
GitHub users to open a “pwn request” and exfiltrate secrets; see also this
[<a href="https://securitylab.github.com/research/github-actions-preventing-pwn-requests" target="_blank" rel="nofollow noopener noreferrer">1</a>,
<a href="https://securitylab.github.com/research/github-actions-untrusted-input" target="_blank" rel="nofollow noopener noreferrer">2</a>,
<a href="https://securitylab.github.com/research/github-actions-building-blocks" target="_blank" rel="nofollow noopener noreferrer">3</a>]
blog post series from GitHub Security Lab and
<a href="https://stackoverflow.com/a/71366152/4654476" target="_blank" rel="nofollow noopener noreferrer">this</a> Stack Overflow answer.</p>
</blockquote>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">on</span><span class="token punctuation">:</span> pull_request_target
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">authorize</span><span class="token punctuation">:</span>
<span class="token key atrule">environment</span><span class="token punctuation">:</span>
$<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event_name == 'pull_request_target' <span class="token important">&&</span>
github.event.pull_request.head.repo.full_name <span class="token tag">!=</span> github.repository <span class="token important">&&</span>
'external' <span class="token punctuation">|</span><span class="token punctuation">|</span> 'internal' <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">test</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> authorize
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v3
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">ref</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event.pull_request.head.sha <span class="token punctuation">|</span><span class="token punctuation">|</span> github.ref <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> printenv EXAMPLE
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">EXAMPLE</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.EXAMPLE <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div>
<p>This workflow will be triggered by the
<a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code></a>
event, which is
<a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore" target="_blank" rel="nofollow noopener noreferrer">similar</a>
to the
<a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request</code></a>
event, but it always passes secrets to workflows triggered from fork pull
requests.</p>
<p>The <code>authorize</code> job checks if the workflow was triggered from a fork pull
request. In that case, the <code>external</code> environment will prevent the job from
running until it’s approved. Otherwise (i.e. when pull requests belong to the
main repository), the job will run without requiring explicit approval.</p>
<p>The <code>test</code> job is where secrets would be used. It
<a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idneeds" target="_blank" rel="nofollow noopener noreferrer"><code>needs</code></a>
the previous job, so it will never run without explicit approval. The security
of this approach is based on the idea of a human approving every run after
making sure that there is no malicious code on them, hence it also overrides
<a href="https://github.com/actions/checkout#checkout-a-different-branch" target="_blank" rel="nofollow noopener noreferrer">the <code>ref</code> from <code>actions/checkout</code></a>
to run on the pull request branch rather than on the main branch.</p>
<h2 id="alternatives" style="position:relative;">Alternatives<a href="#alternatives" aria-label="alternatives permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Admittedly, adding this <code>authorize</code> job to the workflow isn’t particularly
elegant but, as of January 2023, GitHub doesn’t provide any official guidance on
how to achieve a similar result in simpler ways.</p>
<ul>
<li>In 2020, GitHub
<a href="https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/" target="_blank" rel="nofollow noopener noreferrer">introduced</a>
an option to send secrets to workflows from fork pull requests, but it only
has effect on fork pull requests from private repositories.</li>
<li>In 2021, GitHub
<a href="https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/" target="_blank" rel="nofollow noopener noreferrer">introduced</a>
an option to
<a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository#configuring-required-approval-for-workflows-from-public-forks" target="_blank" rel="nofollow noopener noreferrer">require approval for all the outside collaborators</a>,
but the <code>pull_request_target</code> event will trigger
<a href="https://docs.github.com/en/enterprise-cloud@latest/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks#about-workflow-runs-from-public-forks" target="_blank" rel="nofollow noopener noreferrer">regardless of the approval settings</a>.</li>
</ul>
<p>Other common alternatives include: skipping tests that need access to secrets,
disabling forks, and using pull request labels or code review approvals to
control the execution of tests.</p>
<h2 id="security-testing" style="position:relative;">Security Testing<a href="#security-testing" aria-label="security testing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This approach has been tested by sporadic security researchers who found our
repositories while looking for the <code>pull_request_target</code> trigger, but none of
them (<a href="https://github.com/iterative/cml/pull/1130" target="_blank" rel="nofollow noopener noreferrer">#1130</a>
[<a href="https://marcyoung.us/post/zuckerpunch" target="_blank" rel="nofollow noopener noreferrer">1</a>] &
<a href="https://github.com/iterative/cml/pull/1322" target="_blank" rel="nofollow noopener noreferrer">#1322</a>) were able to bypass this
protection. If you find out a way of bypassing it, please feel free to put
<a href="https://iterative.ai/security-and-privacy/" target="_blank" rel="nofollow noopener noreferrer">our bug bounty program</a> to good
use.</p>
<hr>
<p>Now you have it! As far as we know, this is currently the most elegant GitHub
Actions configuration for testing pull requests from public repository forks
using secrets. As maintainers of a lot of open source software, this is close to
our hearts!</p>
<p>Here are some example usages for
<a href="https://github.com/iterative/cml/blob/1be24edaa817de320a657ec3ad1182e145aecef7/.github/workflows/test-deploy.yml#L13-L20" target="_blank" rel="nofollow noopener noreferrer">cml</a>
and
<a href="https://github.com/iterative/mlem/blob/462384ee7a9fc50196e06942684171e9915f46ae/.github/workflows/check-test-release.yml#L13-L25" target="_blank" rel="nofollow noopener noreferrer">mlem</a>.</p>
<p><em>Do you have any better alternative or maybe a similar use case and want to
discuss more? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1Mon, 17 Apr 2023 00:00:00 GMT<h1 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><a href="https://www.openfoam.com/" target="_blank" rel="nofollow noopener noreferrer">OpenFOAM</a> is a powerful, open-source software tool
used for
<a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics" target="_blank" rel="nofollow noopener noreferrer">computational fluid dynamics</a>
(CFD) simulations. It allows engineers and scientists to model and analyze the
flow of fluids, such as gases and liquids, through intricate geometries and
physical phenomena. For example, such physical phenomena could be turbulence,
heat transfer, and chemical reactions. OpenFOAM has a large and dedicated user
base and is utilized in a variety of industries, including aerospace,
automotive, chemical, energy, and marine engineering.</p>
<p>This post focuses on the following challenges that users of OpenFOAM may
encounter:</p>
<ol>
<li>
<p><strong>Complexity</strong>: OpenFOAM is a highly flexible and powerful tool, but this can
also make it difficult for new users to learn and navigate. The software has
a large number of solvers and utilities, and it can be challenging to
understand which solver is most suitable for a given problem.</p>
</li>
<li>
<p><strong>Data management:</strong> OpenFOAM simulations generate a number of outputs that
need to be stored, versioned, shared, and cleaned up when needed.</p>
</li>
<li>
<p><strong>Interfacing with other software:</strong> OpenFOAM may need to be used in
conjunction with other software, such as CAD or mesh generation tools, and
there can be challenges in integrating these tools and transferring data
between them.</p>
</li>
<li>
<p><strong>Software version control:</strong> OpenFOAM and simulation software are constantly
updating and very complex software packages.</p>
</li>
</ol>
<p>All challenges above become more challenging for a small team of researchers who
develop and run simulations. They may lack experience with DevOps and cloud
Infrastructure management. Therefore, having a handy toolset is needed to help
with pipelines and infrastructure setup.</p>
<p>With DVC you may manage versions of simulation outputs, pipelines, and control
software versions used to execute the pipeline ensuring consistent results.
These features allow users to ensure that the new version of the software
produces the same results as previous versions, helping to maintain the
reliability and accuracy of the simulations. <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a> and
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> together provide a key for
cloud resources management, running new experiments via nice UI, showing
parameters and results of the simulation.</p>
<p>We describe these and other features in the two following posts. In this post,
we discuss how Iterative tools help with physical and computational simulations.
To do this, we’ll go over a simple demo project built with OpenFOAM. The demo
shows how to set up DVC for simulation experiments and data management.</p>
<p>These posts are a result of collaboration between the
<a href="http://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.ai</a> and
<a href="https://plasmasolve.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">PlasmaSolve</a> teams. PlasmaSolve was founded
in 2016 by plasma physicists and software engineers to provide a platform for
cutting-edge physics simulation services and research. The PlasmaSolve team
strives to deliver top-notch solutions and well-designed physics simulations to
speed up research and reduce development costs using various open-source and
commercial simulation tools.</p>
<p><strong>In this post, you will learn how to:</strong></p>
<ol>
<li>
<p>Configure and run OpenFOAM simulations with DVC</p>
</li>
<li>
<p>Store and share simulation data in the cloud using DVC</p>
</li>
</ol>
<h1 id="sonicfoam-simulation-pipeline" style="position:relative;"><code>sonicFoam</code> simulation pipeline<a href="#sonicfoam-simulation-pipeline" aria-label="sonicfoam simulation pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>OpenFOAM simulations may include several computational steps, from mesh
generation to a large number of solvers and post-processing simulation results.
SonicFoam is a simulation tool based on the open-source CFD (Computational Fluid
Dynamics) software OpenFOAM. It is used to simulate compressible, inviscid flows
with high Mach numbers, such as supersonic flows.</p>
<p>In this demo, we simulate a supersonic flow over a step located at the front of
the flow. The scenario involves a Mach 3 flow entering a rectangular area with a
step near the inlet, which creates shock waves. We use the same geometry to run
two chained simulations: <code>sonicFoam</code> and <code>scalarTransportFoam</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 531px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/988022d67d2d773683b934528aaf264f/39600/shock_fronts.png" alt="Shock fronts in the forward step problem" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Shock
fronts in the forward step problem
<a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">(source)</a></em></p>
<p>Our demo simulation pipeline contains a few steps:</p>
<ol>
<li>
<p>Generate geometry with <code>blockMesh</code>;</p>
</li>
<li>
<p>Run <code>sonicFoam</code> simulation to get velocity (<code>U</code>) and temperature (<code>T</code>)
fields;</p>
</li>
<li>
<p>Post-processing simulation results;</p>
</li>
<li>
<p>Run a subsequent <code>scalarTransportFoam</code> simulation that uses the velocity
field computed before.</p>
</li>
</ol>
<p>In reality, simulations sometimes need to be “chained”, i.e. outputs of one
simulation go as an input to another simulation. When running a parametric study
of such a simulation chain, intermediate simulations are often recomputed even
if the parameter change does not influence them. We demonstrate how to use DVC
to cache all the results and only trigger a computation if really necessary.
Results of the <code>sonicFoam</code> solver go as inputs to the <code>scalarTransportFoam</code>
solver.</p>
<p>As a basis for the demo, we use OpenFOAM
<a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">Supersonic flow over a forward-facing step tutorial</a>.
The original code can be found
<a href="https://develop.openfoam.com/Development/openfoam/tree/master/tutorials/compressible/sonicFoam/laminar/forwardStep" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="setup-the-demo-project" style="position:relative;">Setup the demo project<a href="#setup-the-demo-project" aria-label="setup the demo project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>💡 For this part of the post, we follow the <code>no-dvc</code> branch in the
<a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/no-dvc" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>.</p>
<p>The easiest way to follow the demo with OpenFOAM simulation is to run in
<a href="https://www.docker.com/" target="_blank" rel="nofollow noopener noreferrer">Docker</a> containers. Follow the setup section in the
repository <code>README</code> to build a Docker image and set up Python virtual
environment and install dependencies.</p>
<p>After the environment is set up we only need to run <code>openfoam-cse-docker</code> script
which runs a new OpenFOAM job in a Docker container. For example, to run the
OpenFOAM simulation in an interactive way, use the command:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker</span></code></pre></div>
<h2 id="1-generate-geometry-with-blockmesh" style="position:relative;">1. Generate geometry with <code>blockMesh</code><a href="#1-generate-geometry-with-blockmesh" aria-label="1 generate geometry with blockmesh permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To use <code>sonicFoam</code>, a user must first create a 3D geometry model of the flow
domain using a tool such as CAD software. The user must then define the boundary
conditions and physical properties of the flow, such as the temperature,
pressure, and velocity at each boundary. The user can then run the simulation
using the <code>sonicFoam</code> solver, which will solve the governing equations of
compressible flow using the finite volume method.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && blockMesh'</span></span></code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 460px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8604d3e8996de7144a3e3553f04fecff/39600/forward_step_geometry.png" alt="Geometry of the forward step" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Geometry
of the forward step
<a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">(source)</a></em></p>
<h2 id="2-run-the-first-step-simulation-with-sonicfoam-solver" style="position:relative;">2. Run the first step simulation with <code>sonicFoam</code> solver<a href="#2-run-the-first-step-simulation-with-sonicfoam-solver" aria-label="2 run the first step simulation with sonicfoam solver permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>During the simulation, <code>sonicFoam</code> will calculate various flow quantities, such
as the pressure, velocity, and temperature, at each point in the flow domain.
The user can then visualize and analyze these results using post-processing
tools, such as ParaView, to gain insight into the flow behavior.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && sonicFoam'</span></span></code></pre></div>
<h2 id="3-post-processing-simulation-results" style="position:relative;">3. Post-processing simulation results<a href="#3-post-processing-simulation-results" aria-label="3 post processing simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As an example of post-processing stages in the simulation demo, we have a few
tasks:</p>
<ul>
<li>
<p>calculate the magnitude of the velocity</p>
</li>
<li>
<p>calculate <code>flowRatePatch</code></p>
</li>
<li>
<p>generate VTK and visualize mesh</p>
</li>
</ul>
<p><strong>Calculate the magnitude of the velocity</strong></p>
<p><code>postProcess</code> is a command allows users to perform post-processing operations on
simulation data. The <code>-func</code> option specifies that a user-defined function
should be applied to the data. In this case calculates and writes the field of
the magnitude of velocity into a file named <code>mag(U)</code> in each time directory
generated during simulation:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && postProcess -func "mag(U)"'</span></span></code></pre></div>
<p>The <code>postProcess</code> command can be used in conjunction with various options and
functions to perform a wide range of post-processing tasks, such as calculating
flow quantities, generating plots, and creating animations. It is an important
tool for gaining insight into the results of CFD simulations.</p>
<p><strong>Calculate <code>flowRatePatch</code></strong></p>
<p>In order to produce a 1D dataset and its visualization we compute the flow rate
over the “outlet” patch. For this purpose, we may apply the
<code>flowRatePatch(name=outlet)</code> function to the simulation data. The
<code>flowRatePatch</code> function calculates the flow rate through a patch, which is a
specified boundary in the flow domain. The input <code>name</code> specifies the patch to
use, in this case, <code>outlet</code>. The <code>outlet</code> patch represents the boundary at the
outlet of the flow domain, so the <code>flowRatePatch</code> function will calculate the
flow rate through the outlet.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && \
postProcess -func "flowRatePatch(name=outlet)"'</span></span></code></pre></div>
<p>This operation saves results into the
<code>sonicFoam/postProcessing/flowRatePatch(name=outlet)/0/surfaceFieldValue.dat</code>
file.</p>
<p><strong>Generate VTK</strong></p>
<p><code>foamToVTK</code> is a utility converts simulation data stored in the OpenFOAM format
to the VTK (<a href="https://vtk.org/about/#overview" target="_blank" rel="nofollow noopener noreferrer">Visualization ToolKit</a>) format.
VTK is a popular file format for storing and visualizing scientific data, and it
is often used for post-processing and visualization of CFD simulations.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && foamToVTK'</span></span></code></pre></div>
<p>This will convert the simulation data stored in the <code>sonicFoam</code> directory from
the OpenFOAM format to the VTK format, allowing it to be visualized and analyzed
using tools that support the VTK format. It creates <code>sonicFoam/VTK/</code> directory
with formatted simulation results.</p>
<h2 id="4-visualize-simulation-results" style="position:relative;">4. Visualize simulation results<a href="#4-visualize-simulation-results" aria-label="4 visualize simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To visualize the results of a simulation performed using the OpenFOAM toolkit's
<code>sonicFoam</code> solver, you can use one of the post-processing tools included with
the OpenFOAM toolkit, such as <code>paraFoam</code> or <code>foamToVTK</code>. These tools allow you
to view and analyze the simulation results in a graphical interface.</p>
<p>In the demo example, a 3D geometry mesh and float pressure diagram are
generated. There are examples of generated files below.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 532px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/98d6462c2a187d6cb6c1006d3b1fc196/39600/3d_mesh_viz.png" alt="3D mesh visualization" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>3D mesh
visualization</em></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/461f191c90cc88d5c5ff70947230537c/39600/float_pressure_diag.png" alt="Float pressure diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Float
pressure diagram</em></p>
<h2 id="5-run-the-second-step-simulation-with-scalartransportfoam-solver" style="position:relative;">5. Run the second step simulation with <code>scalarTransportFoam</code> solver<a href="#5-run-the-second-step-simulation-with-scalartransportfoam-solver" aria-label="5 run the second step simulation with scalartransportfoam solver permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The <code>scalarTransportFoam</code> is a solver in the open-source CFD software OpenFOAM
that is used to solve a transport equation for a passive scalar using a
specified stationary velocity field. It is typically used to calculate the
convection diffusion of a scalar in a given velocity field.</p>
<p>Before running <code>scalarTransportFoam</code> solver, we need to update the stage
configuration based on the <code>sonicFoam</code> outputs:</p>
<ul>
<li>
<p>Copy <code>U</code> config from the last simulation stage in <code>sonicFoam</code></p>
</li>
<li>
<p>Update <code>T</code> config with the <code>boundaryField</code> from the last simulation stage in
<code>sonicFoam</code></p>
</li>
<li>
<p>Copy the <code>polyMesh</code> to use the same geometry</p>
</li>
</ul>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token comment"># Configure scalarTransportFoam</span>
<span class="token line"><span class="token input">$ </span><span class="token command">python3</span> src/config_scalarTransportFoam.py
</span>
<span class="token comment"># Run scalarTransportFoam simulation</span>
<span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd scalarTransportFoam && scalarTransportFoam'</span></span></code></pre></div>
<p>The simulation will calculate the transport of the passive scalar using the
specified velocity field and other input parameters. The resulting simulation
data can then be post-processed and analyzed to gain insight into the transport
of the scalar in the flow.</p>
<h1 id="reduce-simulation-management-complexity-with-dvc" style="position:relative;">Reduce simulation management complexity with DVC<a href="#reduce-simulation-management-complexity-with-dvc" aria-label="reduce simulation management complexity with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>💡 For this part of the post, we follow the <code>main</code> branch in the
<a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/main" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>.
Please follow the README to prepare your environment and install dependencies.</p>
<p>Up to this moment, we run different tasks for the simulation pipeline using
separate commands. Let’s see how DVC tools can help with automating the
simulation pipeline and handling simulation output data.</p>
<p>DVC pipelines is a feature of the <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> (Data Version Control)
tool. A DVC pipeline is a series of commands that are executed in a specific
order and can be used to run all steps that are needed- simulation itself,
post-processing the results, and generating reports. DVC automatically captures
and tracks the data and code associated with your OpenFOAM simulations to make
them reproducible and shareable with your team.</p>
<h2 id="basic-computational-stage-configuration" style="position:relative;">Basic computational stage configuration<a href="#basic-computational-stage-configuration" aria-label="basic computational stage configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A DVC config file is written in YAML format and consists of a list of steps,
each of which corresponds to a command that should be executed as part of the
pipeline. The steps can depend on one another, meaning that the output from one
step is used as input for another step. More details can be found on the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#stage-entries" target="_blank" rel="nofollow noopener noreferrer">DVC documentation website</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ba275bdbf4ec6069aa0f53d531f8bdbc/39600/dag.png" alt="DVC DAG" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Let’s consider an example of the DVC pipeline configuration for <code>blockMesh</code>
stage below.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">blockMesh</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> bash run.sh 'cd sonicFoam <span class="token important">&&</span> blockMesh'
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> sonicFoam/system/blockMeshDict
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> sonicFoam/constant/polyMesh</code></pre></div>
<p>The <code>cmd</code> field specifies the command to be executed, which in this case is a
utility shell script <code>run.sh</code> that changes the file permissions and runs the
<code>blockMesh</code> command directly or using <code>openfoam-cse-docker</code> script. The <code>run.sh</code>
script “knows” how to run the simulations pipeline on your local environment
(manually) or as a part of the GitLab CI pipeline on the Cloud environment
(automatically). We will discuss CI configuration in later sections.</p>
<p>The <code>deps</code> field in this pipeline step specifies the input files that the
<code>blockMesh</code> command depends on <code>blockMeshDict</code> file. These files contain
information about the mesh and the simulation parameters, and are required by
the <code>blockMesh</code> command to generate the mesh.</p>
<p>The <code>outs</code> field specifies the output files generated by the <code>blockMesh</code>
command. In this case, the output is the <code>polyMesh</code> directory, which contains
the generated mesh data. The mesh data is captured and versioned by DVC.</p>
<h2 id="configure-simulation-pipelines-with-paramsyaml" style="position:relative;">Configure simulation pipelines with <code>params.yaml</code><a href="#configure-simulation-pipelines-with-paramsyaml" aria-label="configure simulation pipelines with paramsyaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC pipeline configuration file (<code>params.yaml</code>) file configures an OpenFOAM
simulation. Here is an extract of the parameters used for <code>sonicFoam</code> stage
configuration:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">configureSim</span><span class="token punctuation">:</span>
<span class="token key atrule">sim_config_dir</span><span class="token punctuation">:</span> configs
<span class="token key atrule">controlDict</span><span class="token punctuation">:</span>
<span class="token key atrule">path</span><span class="token punctuation">:</span> system/controlDict
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token key atrule">startTime</span><span class="token punctuation">:</span> <span class="token number">0</span>
<span class="token key atrule">endTime</span><span class="token punctuation">:</span> <span class="token number">3</span>
<span class="token key atrule">deltaT</span><span class="token punctuation">:</span> <span class="token number">0.002</span>
<span class="token key atrule">writeInterval</span><span class="token punctuation">:</span> <span class="token number">0.5</span>
<span class="token key atrule">purgeWrite</span><span class="token punctuation">:</span> <span class="token number">0</span>
<span class="token key atrule">writePrecision</span><span class="token punctuation">:</span> <span class="token number">5</span>
<span class="token key atrule">timePrecision</span><span class="token punctuation">:</span> <span class="token number">6</span></code></pre></div>
<p>The <code>params</code> field of the <code>controlDict</code> section specifies the values of the
simulation control parameters. In this case, the <code>startTime</code>, <code>endTime</code>,
<code>deltaT</code>, <code>writeInterval</code>, <code>purgeWrite</code>, <code>writePrecision</code>, and <code>timePrecision</code>
parameters are set to specific values.</p>
<p>In the DVC simulation setup, the user is responsible for putting the values from
the <code>params.yaml</code> file into the <code>controlDict</code>. Unlike other tools that handle
this process automatically, this approach requires some manual effort on the
user's end but provides greater flexibility as it eliminates the need for
support for each and every tool or software used in the simulation. The demo
showcases how this task is carried out through the <code>src/configureSim.py</code> script.</p>
<h2 id="adapt-dvc-behavior-for-the-simulation-use-case" style="position:relative;">Adapt DVC behavior for the simulation use case<a href="#adapt-dvc-behavior-for-the-simulation-use-case" aria-label="adapt dvc behavior for the simulation use case permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC pipeline configuration expects that all inputs and outputs of each stage are
explicitly defined in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. This is a common pattern in Machine
Learning and Data Management pipelines. DVC uses explicit <code>deps</code> and <code>outs</code> to
build a computational DAG and “understand” whether it needs to re-run a stage if
some of its dependencies change. This ensures the reproducibility of the
pipeline.</p>
<p>However, OpenFOAM simulation pipelines are different. Depending on the
simulation parameters (e.g. <code>endTime</code> and <code>writeInterval</code> in the <code>controlDict</code>
parameters), a different number of files and folders can be generated.
Therefore, it may impossible to specify all outputs in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in advance.
But, because of these files are not specified in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, DVC can’t manage
them properly. To solve this problem, we introduced two helper scripts that
“help” DVC to find and handle generated files and folders for the simulation use
case. Hopefully,
<a href="https://github.com/iterative/dvc/issues/4816" target="_blank" rel="nofollow noopener noreferrer">supporting wildcard patterns</a> in
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> configuration file will simplify such use cases!</p>
<p>Let’s introduce two additional helper scripts:</p>
<ul>
<li><code>dvc_outs_remove.py</code> - removes the stage outputs from the previous simulation.
This script checks if there are files previously added by
<code>dvc_outs_handler.py</code> script and remove them from DVC with <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a>
command.</li>
<li><code>dvc_outs_handler.py</code> - finds all “untracked” and adds them to DVC control. By
default, only files tracked by either Git or DVC are saved to the experiment.
This script checks if there are files or directories generated by the stage
and add them to DVC with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> command.</li>
</ul>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">sonicFoam</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span>
<span class="token comment"># Remove previous sim results</span>
<span class="token punctuation">-</span> python3 src/dvc_outs_remove.py <span class="token punctuation">-</span><span class="token punctuation">-</span>stage=sonicFoam <span class="token punctuation">...</span>
<span class="token comment"># Run sim</span>
<span class="token punctuation">-</span> bash run.sh 'cd sonicFoam <span class="token important">&&</span> sonicFoam'
<span class="token comment"># Add generated files to DVC and create outputs index files</span>
<span class="token punctuation">-</span> python3 src/dvc_outs_handler.py <span class="token punctuation">-</span><span class="token punctuation">-</span>stage=sonicFoam <span class="token punctuation">...</span>
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> configureSim
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> sonicFoam/constant/polyMesh/
<span class="token punctuation">-</span> <span class="token punctuation">...</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div>
<h2 id="link-stages-and-multiple-solvers" style="position:relative;">Link stages and multiple solvers<a href="#link-stages-and-multiple-solvers" aria-label="link stages and multiple solvers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It is common for OpenFOAM simulations to involve complex pipelines with multiple
steps and dependencies between the steps. This is because simulations often
require the use of multiple solvers, each of which may have its own input and
output files and dependencies on other solvers.</p>
<p>For example, a simulation may require the use of multiple solvers to simulate
different physical phenomena, such as fluid flow, heat transfer, and chemical
reactions. These solvers may need to be run in a specific order and may depend
on the output of other solvers as input.</p>
<p>It’s possible to manage these dependencies with DVC! DVC allows you to specify
the steps in the simulation pipeline and the dependencies between them in a
configuration file.</p>
<p>The demo project example has two solvers: <code>sonicFoam</code> and <code>scalarTransportFoam</code>.
Both solvers depend on the same geometry generated by the <code>blockMesh</code> stage. In
the case we know exactly the path to the output (<code>outs</code>) of the <code>sonicFoam</code>
solver, we may explicitly define it as a dependency (<code>deps</code>) of the
<code>scalarTransportFoam</code> stage. In our case, we use a utility script
(<code>src/config_scalarTransportFoam.py</code>) to get the results of the <code>sonicFoam</code>
solver and prepare the initial state for the <code>scalarTransportFoam</code> solver.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">scalarTransportFoam</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> python3 src/config_scalarTransportFoam.py
<span class="token punctuation">-</span> <span class="token punctuation">...</span>
<span class="token punctuation">-</span> bash run.sh 'cd scalarTransportFoam <span class="token important">&&</span> scalarTransportFoam'
<span class="token punctuation">-</span> <span class="token punctuation">...</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> sonicFoam/constant/polyMesh/
<span class="token punctuation">-</span> <span class="token punctuation">...</span>
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> plotMesh
<span class="token punctuation">-</span> scalarTransportFoam
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div>
<h2 id="run-a-new-simulation" style="position:relative;">Run a new simulation<a href="#run-a-new-simulation" aria-label="run a new simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>After the DVC pipeline is set up, you may run a new simulation experiment with a
command:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>To run a new simulation with updated parameters you may manually change the
parameter value in the <code>params.yaml</code> file and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> or, it’s
possible to
<a href="https://dvc.org/doc/command-reference/exp/run#example-modify-parameters-on-the-fly" target="_blank" rel="nofollow noopener noreferrer">modify parameters on-the-fly</a>.
For example, let’s change the length of our simulation:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'configureSim.controlDict.params.endTime=4'</span></span></code></pre></div>
<p>It is also possible to queue and run multiple simulations in parallel.</p>
<p>In the next post, we will show how to visualize and compare simulation data with
CML and Iterative Studio.</p>
<h1 id="versioning-and-sharing-simulation-data-with-dvc" style="position:relative;">Versioning and sharing simulation data with DVC<a href="#versioning-and-sharing-simulation-data-with-dvc" aria-label="versioning and sharing simulation data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Effective data management is essential for successful OpenFOAM simulations.
Proper data management can help you organize and track the data and code
associated with your simulations, and make it easier to reproduce simulation
results.</p>
<p>There are several challenges that users of OpenFOAM may encounter in managing
the data associated with their simulations:</p>
<ol>
<li>
<p><strong>Large data volumes</strong>: OpenFOAM simulations can generate large amounts of
data, particularly for complex or high-resolution simulations. This can make
it difficult to store, transfer, and analyze the data effectively.</p>
</li>
<li>
<p><strong>Data version control</strong>: It is important for users to be able to track
changes to the input files and simulation results over time and to be able to
reproduce past simulations. This can be challenging without a version control
system or other means of tracking changes.</p>
</li>
<li>
<p><strong>Data transfer</strong>: Users may need to transfer large amounts of data between
different systems or devices, such as between their personal computers and a
high-performance computing cluster. This can be challenging due to the size
of the data and the potential for data transfer bottlenecks.</p>
</li>
<li>
<p><strong>Collaboration</strong>: Users may want to share simulation results with colleagues
or collaborate on simulations. This can be done by sharing the simulation
input files and results, as well as using tools such as online collaborative
platforms or version control systems.</p>
</li>
</ol>
<p>Luckily, DVC may help with all of them. Let’s review the core features of DVC
that we used in the demo project.
<a href="https://dvc.org/doc/use-cases/versioning-data-and-models" target="_blank" rel="nofollow noopener noreferrer">Data versioning</a> is a
core feature of DVC that helps to capture the versions of simulation data in Git
commits, while storing them on-premises or in cloud storage. Moreover, using DVC
pipelines, all outputs specified as <code>outs</code>, <code>plots</code>, or <code>metrics</code> in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>
configuration, are automatically added to DVC version control! Other files,
generated by different stages, are added to DVC via <code>dvc_outs_handler.py</code>
script. The next step is to set up DVC remote storage and upload these files
there.</p>
<p>DVC help to store large volumes of data in the on-premise or cloud storage (e.g.
SSH, S3, HDFS,
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">etc.</a>)
The demo project uses AWS S3 as a remote storage. For more details on the remote
storage configuration you may check
<a href="https://dvc.org/doc/command-reference/remote#example-customize-an-additional-s3-remote" target="_blank" rel="nofollow noopener noreferrer">Example: Customize an additional S3 remote</a>.</p>
<p>You may add your own remote storage in AWS S3 bucket using the following
command:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> s3remote url s3://<span class="token operator"><</span>bucket<span class="token operator">></span>/<span class="token operator"><</span>path<span class="token operator">></span></span></code></pre></div>
<p>After the remote storage is set up, you need a single additional command to
transfer your results to the storage:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span></span></code></pre></div>
<p>With this DVC takes care of pushing and pulling to/from both Git and DVC remotes
in the case of experiments. Therefore, the following collaboration with
colleagues is simple. Your colleagues may access your last simulation results
with a <a href="https://dvc.org/doc/command-reference/exp/pull"><code>dvc exp pull</code></a> command (after updating their repository with <code>git pull</code>):</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span></span></code></pre></div>
<h1 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This post details how Iterative tools help in physical and computational
simulations. The demo shows how to set up DVC for simulation experiments and
data management.</p>
<p>Overall, DVC can help OpenFOAM users to:</p>
<ol>
<li>
<p>Reduce the complexity of simulation pipelines and automate tasks such as
running simulations, post-processing results, and generating reports.</p>
</li>
<li>
<p>Manage and track the data and code associated with your OpenFOAM simulations,
and make it easier to reproduce simulation results.</p>
</li>
<li>
<p>Manage simulation experiments with a YAML config files.</p>
</li>
<li>
<p>Store and share simulation data in the cloud using DVC and AWS S3.</p>
</li>
<li>
<p>Easily collaborate with your colleagues around simulation results, share and
reuse data.</p>
</li>
</ol>
<p>In the next post, we will discuss how to utilize cloud computing resources and
visualize and compare simulation data with CML and Iterative Studio.</p>https://dvc.org/blog/automate-your-ml-pipeline-combining-airflow-dvc-and-cml-for-a-seamless-batch-scoring-experiencehttps://dvc.org/blog/automate-your-ml-pipeline-combining-airflow-dvc-and-cml-for-a-seamless-batch-scoring-experienceWed, 22 Mar 2023 00:00:00 GMT<p>Companies in Banking, Telecom, Retail, and other industries operate the enormous
size of data to generate insights and gain value.
<a href="https://www.datarobot.com/wiki/scoring/" target="_blank" rel="nofollow noopener noreferrer">Batch scoring</a> is a common way to
operate machine learning applications for such companies. It helps to run ML
training and inference (scoring) jobs that operate with large amounts of data.
This post covers topics around the design, tools, and implementation of ML
applications for batch scoring scenarios with Airflow.</p>
<h3 id="what-is-batch-scoring" style="position:relative;">What is batch scoring?<a href="#what-is-batch-scoring" aria-label="what is batch scoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In machine learning, scoring is the process of applying a trained model to a new
dataset in an attempt to get practical predictions. Batch scoring is the way to
score (get predictions) for large datasets that are collected over some period
of time before being passed to the model. It is the most effective scoring
pattern when the model’s decisions don’t have to be implemented immediately. For
example, a CRM Department in Retail Banking may apply ML models to a batch of
active customers to determine which are most likely to buy a new credit product
next month. Other application examples:</p>
<ul>
<li>
<p><strong>Marketing Communication Optimization:</strong> effectively identifying customers
who are looking for new financial products and services, and then optimizing
marketing communication, is a perfect application for AI. This use case
includes not only identifying customers with a propensity to buy new products,
but also customers at risk of churning.</p>
</li>
<li>
<p><strong>Pricing Optimization:</strong> personalization of banking services requires
monitoring the marketplace dynamically to provide competitive prices for
existing and new customers.</p>
</li>
<li>
<p><strong>Next Best Action (NBA):</strong> this is a promising customer-centric approach to
optimize multiple different actions that could be taken for a specific
customer through multiple communication channels.</p>
</li>
</ul>
<h3 id="goals-for-this-post" style="position:relative;">Goals for this post<a href="#goals-for-this-post" aria-label="goals for this post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This post shares an approach to solve 3 tasks in batch scoring applications:</p>
<ul>
<li>
<p>Build an ML pipeline to train a model.</p>
</li>
<li>
<p>Setup a <code>train</code> CI job to run a model training at scale.</p>
</li>
<li>
<p>Setup a <code>deploy</code> CI job to deliver the inference (scoring) pipeline to an
Airflow cluster.</p>
</li>
</ul>
<h3 id="how-to-reproduce" style="position:relative;">How to reproduce<a href="#how-to-reproduce" aria-label="how to reproduce permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Code examples are stored in two repositories:</p>
<ul>
<li>
<p><a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">home_credit_default</a>
contains an end-to-end solution for a batch scoring application with Airflow</p>
</li>
<li>
<p><a href="https://gitlab.com/iterative.ai/cse_public/airflow-cluster" target="_blank" rel="nofollow noopener noreferrer">airflow-cluster</a>
contains configuration for Airflow and other services</p>
</li>
</ul>
<p>Fork the
<a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">home_credit_default</a>
repository if you'd like to replicate our steps and deploy your own
batch-scoring application with Airflow and DVC. Keep in mind that you'll need
the setup and to configure the following:</p>
<ul>
<li>
<p>GitLab account and
<a href="https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html" target="_blank" rel="nofollow noopener noreferrer">Personal Access Token</a>.</p>
</li>
<li>
<p><a href="https://pipenv.pypa.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer"><code>pip</code></a> and Docker installed locally</p>
</li>
</ul>
<p>The repository also contains code for Airflow DAGs, which can be found in the
<code>dags/</code> directory. A separate
<a href="https://gitlab.com/iterative.ai/cse_public/airflow-cluster" target="_blank" rel="nofollow noopener noreferrer">airflow-cluster</a>
repository is used to set up and run the Airflow cluster.</p>
<h2 id="design-ml-pipelines-with-dvc" style="position:relative;">Design ML pipelines with DVC<a href="#design-ml-pipelines-with-dvc" aria-label="design ml pipelines with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Machine Learning experiment pipelines for batch-scoring applications typically
involve the following steps:</p>
<ol>
<li>
<p><strong>Data preparation:</strong> The first step is to clean, pre-process, and transform
the data into a format that can be used for training machine learning models.</p>
</li>
<li>
<p><strong>Feature engineering:</strong> In this step, relevant features are extracted or
created from the data and transformed into a format that can be used for
training machine learning models.</p>
</li>
<li>
<p><strong>Model selection and training:</strong> Next, multiple machine learning models are
selected and trained using the prepared data.</p>
</li>
<li>
<p><strong>Model evaluation:</strong> The trained models are then evaluated to determine
their accuracy and performance on new data.</p>
</li>
</ol>
<p>By following these steps, the pipeline provides a systematic approach to
experimenting with different machine learning models, including feature
engineering, and selecting the best one for deployment.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/082baaa38ed94173d19d9a198cbbbda1/39600/diagram.png" alt="DVC Pipeline Design" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Machine
Learning experiment pipelines for batch scoring applications</em></p>
<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is a great tool that can help to automate such kinds of
ML pipelines. For the purpose of this tutorial, the DVC pipeline consists of
five stages (see <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in
<a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">the example repo</a>):</p>
<ul>
<li>
<p>Load Data (<code>load_data</code>)</p>
</li>
<li>
<p>Calculate features for <code>bureau.csv</code> data (<code>extract_features_bureau</code>)</p>
</li>
<li>
<p>Calculate features for <code>application.csv</code> data (<code>extract_features_application</code>)</p>
</li>
<li>
<p>Join features (<code>join_features</code>)</p>
</li>
<li>
<p>Train and save a model (<code>train</code>)</p>
</li>
</ul>
<p>The diagram below visualizes dependencies between stages of the DVC pipeline.
For such patterns, DVC helps automatically track changes and optimize the time
to run the pipeline. For example, if you iteratively improve only code to
calculate features for Bureau data, DVC will only rerun 3 stages:
<code>extract_features_bureau</code>, <code>join_features</code>, and <code>train</code>. DVC with skip running
<code>load data</code> and <code>extract_features_application</code> because these steps did not
change, saving a substantial amount of time.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2e7610eb51bc6541433ff3bb99bc7c87/39600/dependency-diagram.png" alt="Dependency diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Dependency Diagram</em></p>
<p>After we prepare the configuration for the ML pipeline, DVC helps to run a new
model training experiment with a simple single command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>Or, if you want to update the configuration of the <code>params.yaml</code> file and set a
specific name of the experiment you may run a command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-n</span> <span class="token operator"><</span>NAME<span class="token operator">></span> <span class="token punctuation">[</span>--set-param <span class="token operator"><</span>param_name<span class="token operator">>=</span><span class="token operator"><</span>param_value<span class="token operator">></span><span class="token punctuation">]</span></span></code></pre></div>
<h2 id="train-model-at-scale-with-studio-and-cml" style="position:relative;">Train model at scale with Studio and CML<a href="#train-model-at-scale-with-studio-and-cml" aria-label="train model at scale with studio and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In a common scenario, batch-scoring applications require a large amount of data
stored in remote storage. Data Scientists run ML experiments on a local (dev)
machine (e.g. laptop) using a sample of the data. After the model and
hyperparameters configuration are found, an additional training run on the full
dataset is required. Sometimes, the final model training is run on a different
high-performance machine. Results for the ML experiments should be stored and
accessible for the next analysis, following experiments, and any team members
that need to review them.</p>
<h3 id="continuous-integration-ci-workflow" style="position:relative;">Continuous Integration (CI) workflow<a href="#continuous-integration-ci-workflow" aria-label="continuous integration ci workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Designing a CI (Continuous Integration) job to run model training at scale
involves the following steps:</p>
<ol>
<li>
<p><strong>Environment setup:</strong> Create a reproducible environment for model training
by using virtual machines or containers. GitLab and <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>
help us preparing and provisioning an environment for the training job.</p>
</li>
<li>
<p><strong>Automated build:</strong> Set up an automated build process that triggers a build
every time code is committed to the repository. We use GitLab CI
configuration to automate building a Docker image and run tests for the code.</p>
</li>
<li>
<p><strong>Parallel processing:</strong> Utilize parallel processing to run multiple model
training jobs in parallel. This reduces the time required to train the model
and can be accomplished using tools like Dask or Ray. In this example, we
don’t use these tools.</p>
</li>
<li>
<p><strong>Training:</strong> Make sure that the model training pipeline can scale to handle
large amounts of data and processing power. As a result of the training job,
a new model is saved. CML may help to set up and use cloud computing
resources or by using high-performance computing systems.</p>
</li>
</ol>
<p>GitLab's Continuous Integration (CI) pipeline configuration for this post
example is stored in the
<a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default/-/blob/main/.gitlab-ci.yml" target="_blank" rel="nofollow noopener noreferrer"><code>.gitlab-ci.yml</code> file</a>.
It specifies different stages of the pipeline including building an image,
testing the code, training a model, and deploying Airflow DAGs. The image below
provides a graphical representation of this pipeline.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/148f86f382ad2e8f6003b216ba489a7d/39600/gitlab-airflow.png" alt="GitLab Continuous Integration Pipeline Configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>GitLab Continuous Integration Pipeline Configuration with Airflow Cluster</em></p>
<ul>
<li>
<p>The GitLab repository triggers the CI pipeline as soon as new code or
parameters updates are committed to the repository. This runs <code>build</code>, <code>test</code>,
and <code>train</code> CI jobs. The <code>train</code> job runs a model training on a full dataset
on a remote machine (or cloud), generates model training reports, and creates
a PR in the GitLab repo.</p>
</li>
<li>
<p>Merging (accepting a pull/merge request) the experiment results into the
<code>main</code> branch triggers the <code>deploy</code> job.</p>
</li>
<li>
<p>Every month, Airflow runs <code>scoring</code> jobs to generate predictions (scores) for
all clients on new data. Generated predictions are stored in the prediction
database or files.</p>
</li>
</ul>
<h3 id="setup-train-job-with-gitlab-ci-and-cml" style="position:relative;">Setup <code>train</code> job with GitLab CI and CML<a href="#setup-train-job-with-gitlab-ci-and-cml" aria-label="setup train job with gitlab ci and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For this post’s example the training job is triggered on creating a new Merge
Request into the <code>main</code> branch or, if the Git commit message (commit to any
branch) contains the <code>[exp]</code> tag. This configuration allows us to achieve two
goals:</p>
<ol>
<li>
<p>We may define whether new code (or params) changes need to trigger a new
experiment, or if it’s just a minor update (e.g. update the documentation in
README) there is no need to run a new experiment,</p>
</li>
<li>
<p>We ensure that every merge into the <code>main</code> branch is linked to the latest
model.</p>
</li>
</ol>
<p>An example of the <code>train</code> job configuration is presented below. There are three
main steps in the <code>script</code> there:</p>
<ol>
<li>
<p>Run a new experiment on a full-scale dataset with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></p>
</li>
<li>
<p>Prepare the <code>report.md</code> file with metrics and plots,</p>
</li>
<li>
<p>Publish the <code>report.md</code> content to the Merge Request (Pull Request) message
in GitLab (<a href="https://cml.dev/doc/ref/publish" target="_blank" rel="nofollow noopener noreferrer">using CML</a>).</p>
</li>
</ol>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token punctuation">...</span>
<span class="token key atrule">rules</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">if</span><span class="token punctuation">:</span> $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main" <span class="token punctuation">|</span><span class="token punctuation">|</span> $CI_COMMIT_MESSAGE =~ /\<span class="token punctuation">[</span>exp\<span class="token punctuation">]</span>/
<span class="token key atrule">image</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span>PROJECT_IMAGE<span class="token punctuation">}</span>
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">...</span>
<span class="token punctuation">-</span> dvc exp run <span class="token punctuation">-</span><span class="token punctuation">-</span>pull <span class="token punctuation">-</span>S load_data.sample_size=1.0
<span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string">
echo "# Metrics" >> report.md
echo "## Experiment metrics" >> report.md
dvc metrics show --show-md >> report.md
...
echo "## Plot train lift curve " >> report.md
echo '' >> report.md</span>
<span class="token punctuation">-</span> cml pr create . <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> cml comment create <span class="token punctuation">-</span><span class="token punctuation">-</span>target=commit report.md</code></pre></div>
<h3 id="run-ml-experiments-with-iterative-studio" style="position:relative;">Run ML experiments with Iterative Studio<a href="#run-ml-experiments-with-iterative-studio" aria-label="run ml experiments with iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The proposed CI pipeline makes it possible to implement a development process
that:</p>
<ul>
<li>
<p>Automates the launch of experiments with training models when the code
changes.</p>
</li>
<li>
<p>Links the change in versions of the code and artifacts (models, data).</p>
</li>
<li>
<p>Makes the development more straightforward and manageable.</p>
</li>
</ul>
<p>Moreover, it enables <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> to run new
experiments from the UI.</p>
<p>The experimenting process of Iterative Studio is very simple! (See diagram
below).</p>
<ul>
<li>
<p>In the first step (1), we update the experiment configuration and trigger
running a new experiment. This functionality is available in the standard
package of the Iterative Studio.</p>
</li>
<li>
<p>Then (2) the configured GitLab CI pipeline launches the experiment job
running.</p>
</li>
<li>
<p>After the job completes, CML publishes the experiment report to GitLab commit
message (3). Iterative Studio is constantly monitoring the project repository
for updates.</p>
</li>
<li>
<p>As soon as the repo changes, Iterative Studio updates tracking files in the UI
(4). Data Scientists can compare experiment metrics and plots.</p>
</li>
<li>
<p>Also, DVC stores the updated versions of a model and artifacts to DVC Storage
(5).</p>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d823204577287d5aaa66392eac57679f/39600/trigger-experiment.png" alt="GitLab Continuous Integration Pipeline Configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>GitLab Continuous Integration Pipeline Configuration with Airflow Cluster</em></p>
<p>After the experiment completes Iterative Studio helps to visualize parameters,
metrics, and plots. Users may compare experiments, run new ones, and share with
colleagues.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1ec8814c2de0e8238ab27cc1d62ad0da/39600/confusion-matrix.png" alt="Confusion Matrices" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Visualize Parameters, Metrics and Plots in Iterative Studio</em></p>
<h2 id="deploy-scoring-pipeline" style="position:relative;">Deploy scoring pipeline<a href="#deploy-scoring-pipeline" aria-label="deploy scoring pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A batch scoring inference pipeline in machine learning is a series of steps that
are executed in a specific order to process a large amount of data and generate
predictions based on a pre-trained model. It typically includes the following
steps:</p>
<ol>
<li>
<p><strong>Input data preparation:</strong> This step involves cleaning, transforming, and
preprocessing the input data so that it can be fed into the model for
prediction. Feature engineering can be a part of this step.</p>
</li>
<li>
<p><strong>Model loading:</strong> The pre-trained model is loaded into memory, usually from
storage or a database, so that it can be used for predictions.</p>
</li>
<li>
<p><strong>Inference:</strong> The input data is passed through the model to generate
predictions. This is done in a batch-wise manner, where a large amount of
data is processed in one go to reduce the overhead of repetitively loading
the model.</p>
</li>
<li>
<p><strong>Post-processing:</strong> This step involves any additional processing of the
prediction results, such as normalization, thresholding, or aggregation,
before they are written to an output file or database.</p>
</li>
<li>
<p><strong>Saving predictions:</strong> Finally, the prediction results are saved to a file
or database for further analysis or use. This can be done in various formats,
such as CSV, JSON, or binary.</p>
</li>
</ol>
<p>The pipeline can be implemented using a variety of tools and technologies such
as Apache Airflow, Apache Spark, or even custom scripts. The key aspect of a
scoring pipeline is that it is automated, efficient, and scalable, making it
possible to score large volumes of data in a timely and consistent manner.</p>
<p>Because of the large number of pre- and post-processing tasks, including
checking for data sources updates, the typical scenario needs to deploy a
scoring pipeline, not a model.</p>
<h3 id="batch-scoring-inference-pipeline-with-airflow" style="position:relative;">Batch scoring inference pipeline with Airflow<a href="#batch-scoring-inference-pipeline-with-airflow" aria-label="batch scoring inference pipeline with airflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In this example, we implement an inference pipeline using
<a href="https://airflow.apache.org/" target="_blank" rel="nofollow noopener noreferrer">Apache Airflow</a>. Airflow helps to schedule and run
pipelines (DAG) for various data engineering and machine learning purposes. DAG
is a Directed Acyclic Graph that describes an order of computational Tasks
(jobs) to run. The basics of the Airflow pipeline definition can be found
<a href="https://airflow.apache.org/docs/apache-airflow/stable/start.html" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p>We store Airflow DAGs in the <code>dags/</code> directory in the same repository as our ML
pipeline.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3ac81fde4d1b093c3cb70d812dae65a/39600/dags.png" alt="DAG" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DAGs Directory</em></p>
<p>Let’s go a bit deeper into the Airflow DAG <code>dags/scoring.py</code> to find out how DVC
is used there! This DAG is designed to be run every 5th day of the month to
calculate predictions and save them into a .csv file.</p>
<p>The DAG performs the following steps:</p>
<ol>
<li>
<p>It creates a temporary directory for the local repository
(<strong><code>create_tmp_dir</code></strong> task).</p>
</li>
<li>
<p>It clones the repository specified in the <strong><code>project_args</code></strong> argument
(<strong><code>clone</code></strong> task).</p>
</li>
<li>
<p>It runs the scoring script from the cloned repository and saves predictions
(<strong><code>run_scoring</code></strong> task).</p>
</li>
<li>
<p>Finally, it removes the temporary repository directory (<strong><code>clean</code></strong> task).</p>
</li>
</ol>
<p>For the purposes of this post, we are most interested in the <code>run_scoring</code> task!
The task 'run_scoring' is a BashOperator in Apache Airflow. It performs the
following actions:</p>
<ol>
<li>
<p>Runs the <a href="https://dvc.org/doc/command-reference/fetch"><code>dvc fetch</code></a> command to fetch the latest version of the artifacts and
model to be used for inference.</p>
</li>
<li>
<p>Runs the <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> command to check out the latest version of the data.</p>
</li>
<li>
<p>Runs a python script located at <code>src/stages/scoring.py</code> with the following
command line arguments:</p>
<ul>
<li>
<p><code>--config</code> specifies the path to the parameters file in YAML format,</p>
</li>
<li>
<p><code>--scoring-date</code> specifies the date for which the scoring should be
performed,</p>
</li>
<li>
<p><code>--storage-path</code> specifies the location of the storage.</p>
</li>
</ul>
</li>
</ol>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">run_scoring <span class="token operator">=</span> BashOperator<span class="token punctuation">(</span>
task_id<span class="token operator">=</span><span class="token string">'run_scoring'</span><span class="token punctuation">,</span>
bash_command<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f''</span></span>'
cd <span class="token punctuation">{</span>project_args<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'dag_run_dir'</span><span class="token punctuation">)</span><span class="token punctuation">}</span> <span class="token operator">&</span><span class="token operator">&</span> \
export PYTHONPATH<span class="token operator">=</span><span class="token punctuation">.</span> k<span class="token operator">&</span><span class="token operator">&</span> \
dvc fetch <span class="token operator">&</span><span class="token operator">&</span> \
dvc checkout <span class="token operator">&</span><span class="token operator">&</span> \
python src<span class="token operator">/</span>stages<span class="token operator">/</span>scoring<span class="token punctuation">.</span>py \
<span class="token operator">-</span><span class="token operator">-</span>config<span class="token operator">=</span>params<span class="token punctuation">.</span>yaml \
<span class="token operator">-</span><span class="token operator">-</span>scoring<span class="token operator">-</span>date<span class="token operator">=</span><span class="token punctuation">{</span><span class="token punctuation">{</span><span class="token punctuation">{</span><span class="token punctuation">{</span> first_day_of_month<span class="token punctuation">(</span>ds<span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">}</span> \
<span class="token operator">-</span><span class="token operator">-</span>storage<span class="token operator">-</span>path<span class="token operator">=</span><span class="token punctuation">{</span>project_args<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'storage_path'</span><span class="token punctuation">)</span><span class="token punctuation">}</span> \</code></pre></div>
<p>Therefore, this example shows the deployment of the Airflow DAGs, and DVC helps
to fetch the latest model to be used for inference. This is awesome!</p>
<h3 id="setup-ci-job-deploy" style="position:relative;">Setup CI job <code>deploy</code><a href="#setup-ci-job-deploy" aria-label="setup ci job deploy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<aside>
💡 Merging (accepting PR) experiment results into the `main` branch triggers the
`deploy` job.
</aside>
<p>There are various strategies for delivering <code>scoring</code> DAG to the Airflow
cluster. In this example, the GitLab CI pipeline pushes (copies) DAG files from
the repo to the Airflow home directory (specified by <code>${AIRFLOW_HOME}</code>) and
activates it.</p>
<p>The <code>deploy_dags</code> CI job configuration looks like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">deploy_dags</span><span class="token punctuation">:</span>
<span class="token key atrule">stage</span><span class="token punctuation">:</span> deploy
<span class="token punctuation">...</span>
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string">
export DAGS_FOLDER=${AIRFLOW_HOME}/dags/${PROJECT_FOLDER}</span>
<span class="token comment"># Create ${DAGS_FOLDER}</span>
rm <span class="token punctuation">-</span>rf $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span> <span class="token important">&&</span> mkdir <span class="token punctuation">-</span>p $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span>
<span class="token comment"># Copy content of folder ./dags to ${DAGS_FOLDER} directory</span>
cp <span class="token punctuation">-</span>r dags/* $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span>
echo "Airflow DAGs copied to $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span>"</code></pre></div>
<p>This simple example is for demonstration purposes, but it works as a
proof-of-concept for DVC-Airflow-Studio integration for batch scoring
applications.</p>
<h2 id="results" style="position:relative;">Results<a href="#results" aria-label="results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The proposed approach demonstrates how DVC, CML, and Iterative Studio may help
in batch scoring applications at the experimentation and production phases.
Solutions discussed in this post may benefit similar use cases in a few ways:</p>
<ul>
<li>
<p>Help with system design and tools integration.</p>
</li>
<li>
<p>Automate ML experiments.</p>
</li>
<li>
<p>Increasing speed of Proof-Of-Concept (POC) and Operationalization (MLOps)
stages.</p>
</li>
<li>
<p>Saving time and money for similar projects.</p>
</li>
</ul>
<p>Specifically, DVC and Iterative Studio can benefit batch scoring Applications
by:</p>
<ul>
<li>
<p>Enabling regulatory compliance and auditability. Iterative Studio offers a
robust approach for data usage tracking, keeping, and versioning data and
configurations used for model training and prediction. Models are developed in
a robust environment allowing us to link code, data, and configs for
reproducible experiments and ensure auditability in the event of a compliance
audit.</p>
</li>
<li>
<p>Run machine learning experiments, with or without coding. Iterative Studio
offers a user-friendly UI for analysts and data scientists to create a new
experiment, change the configuration, and run with a one-button-click.</p>
</li>
<li>
<p>Access versioned models during the CI/CD process and use them to run a scoring
job with Airflow.</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
</li>
</ul>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/cloud-versioninghttps://dvc.org/blog/cloud-versioningWed, 22 Feb 2023 00:00:00 GMT<p>If you use cloud storage regularly, you have probably seen it become a mess like
this S3 bucket:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8f24ac8a282698234abe9b2368846be2/39600/no_versions.png" alt="no versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Luckily, major cloud storage providers can version files automatically. Still,
even with versioning enabled, you might find you end up with a mess. More
importantly, you forget which version is which.</p>
<p>That's because versioning happens at the file level. There's no way to version a
composite dataset or entire machine learning project. This is where DVC can
supplement cloud versioning and finally let you clean up your cloud storage. DVC
records the versions of all the files in your dataset, so you have a complete
snapshot of each point in time. You can store this record in Git alongside the
rest of your project and use it to recover the data from that time, giving you
the freedom to keep adding new data in place without fear of losing track of the
old data. DVC ensures reproducibility while keeping everything organized between
your Git repo and cloud storage, so you can focus on iterating on your machine
learning project.</p>
<admon type="info">
<p>If you already use DVC, you might be familiar with data versioning and want to
know what DVC cloud versioning means for you. Read the next section to get more
familiar with cloud versioning generally or skip directly to the section
<a href="#for-existing-dvc-users">for existing DVC users</a>.</p>
</admon>
<h1 id="how-cloud-versioning-works" style="position:relative;">How cloud versioning works<a href="#how-cloud-versioning-works" aria-label="how cloud versioning works permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>With versioning enabled, whenever you save a file to the cloud, it will get a
unique version ID. When you overwrite (or even delete) a file, the previous
version remains accessible by referencing its version ID.</p>
<p>Here's the same data from above organized with cloud versioning:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/90dc198bc1858e98b9b1aabc314958be/39600/show_versions.png" alt="show versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Overwritten and deleted
files may be recovered using their version IDs.</em></p>
<p>And here it is showing only the current versions:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7f12527c5ca6a17e2200dc6756d58678/39600/collapsed_versions.png" alt="collapsed versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Enabling versioning
can keep your cloud storage organized by collapsing file versions.</em></p>
<p>Now the model versions are all collapsed under one file name and ordered by
time, but what about the <code>predictions</code> folder? Let's assume this project trains
a neural machine translation model, and each file in <code>predictions</code> is a
predicted translation of a sentence. Each model iteration generates a new set of
predictions. How can we reassemble the predictions from an earlier model
version?</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/427c01d0e46fa7a15d31f51f544d406a/39600/dir_versions.png" alt="dir versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>For a folder of many files,
keeping track of versions becomes unrealistic.</em></p>
<h1 id="how-dvc-works-with-cloud-versioning" style="position:relative;">How DVC works with cloud versioning<a href="#how-dvc-works-with-cloud-versioning" aria-label="how dvc works with cloud versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Cloud versioning falls short for tracking and syncing folders and projects, but
this is where DVC can help. DVC records the version IDs of all files in your
dataset or project. You keep this record in a Git repository so you can maintain
snapshots of your cloud-versioned data (the data itself gets stored on the
cloud, not in Git).</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9de9366bfc669d5157b3bd8e6ba4d152/39600/dir_versions_dvc.png" alt="dir versions dvc" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC connects multiple
version IDs across a folder or project.</em></p>
<admon type="tip">
<p>Before you start with DVC, ensure that your cloud storage is configured
correctly. Cloud versioning must be enabled at the bucket or storage account
level. See <a href="#quickstart">Quickstart</a> for instructions below if versioning is not
already enabled. You also need write access to the cloud storage (more info on
how to configure your storage
<a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">here</a>).</p>
</admon>
<p>To start using cloud versioning in DVC, <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install</a>
DVC and set up a <code>version_aware</code> remote inside a Git repo. A remote is the cloud
storage location where you want to sync the data, and <code>version_aware</code> tells DVC
to use cloud versioning.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span>
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">--default</span> myremote s3://cloud-versioned-bucket/path
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote version_aware <span class="token boolean">true</span></span></code></pre></div>
<p>Use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to start tracking your model and predictions and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to
sync it to the cloud.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> model.pt predictions
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span>
</span>11 files pushed</code></pre></div>
<admon type="tip">
<p>If you want to start tracking changes to an existing cloud dataset instead of
starting from a local copy, see
<a href="https://dvc.org/doc/command-reference/import-url#example-tracking-cloud-version-ids" target="_blank" rel="nofollow noopener noreferrer">dvc import-url —version-aware</a>.</p>
</admon>
<p>DVC adds <code>model.pt.dvc</code> and <code>predictions.dvc</code> files with the version ID (and
other metadata) of each file.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> predictions
<span class="token key atrule">files</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 0.txt
<span class="token key atrule">md5</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca
<span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span>
<span class="token key atrule">myremote</span><span class="token punctuation">:</span>
<span class="token key atrule">etag</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca
<span class="token key atrule">version_id</span><span class="token punctuation">:</span> UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7
<span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 1.txt
<span class="token key atrule">md5</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb
<span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span>
<span class="token key atrule">myremote</span><span class="token punctuation">:</span>
<span class="token key atrule">etag</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb
<span class="token key atrule">version_id</span><span class="token punctuation">:</span> fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz</code></pre></div>
<p>Next, track <code>model.pt.dvc</code> and <code>predictions.dvc</code> in Git.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> model.pt.dvc predictions.dvc .gitignore
</span>
<span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"added and pushed model and predictions"</span></span></code></pre></div>
<admon type="tip">
<p>DVC will also make Git ignore <code>model.pt</code> and the <code>predictions</code> folder so that
Git only tracks the metadata. For more info on the mechanics of how DVC works,
see
<a href="https://dvc.org/doc/use-cases/versioning-data-and-models" target="_blank" rel="nofollow noopener noreferrer">Versioning Data and Models</a>.</p>
</admon>
<p>Now there is a versioned record of the model and predictions in Git commits, and
we can revert to any of them without having to manually track version IDs. If
someone else clones the Git repo, they can pull the exact versions pushed with
that commit, even if those have been overwritten in cloud storage.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:iterative/myrepo
</span>
<span class="token line"><span class="token input">$ </span><span class="token command">cd</span> myrepo
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span>
</span>A predictions/
A model.pt
2 files added and 11 files fetched</code></pre></div>
<h1 id="for-existing-dvc-users" style="position:relative;">For existing DVC users<a href="#for-existing-dvc-users" aria-label="for existing dvc users permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>If you have versioning enabled on your cloud storage (or can enable it), you may
wish to start using <code>version_aware</code> remotes to simplify the structure of your
remote (or so you don't have to explain that structure to your colleagues). A
<code>version_aware</code> remote is similar to the remotes you already use, except easier
to read.</p>
<p>A traditional cache-like DVC remote looks like:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5b81c5336415ea6c37f1e551e23eb3ea/39600/remote_cache.png" alt="remote cache" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>A cloud-versioned remote looks like:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/16c473880803d465f38e753ef4d97a06/39600/remote_cloud_versioned.png" alt="remote cloud versioned" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>The other difference is that version IDs get added to the
<a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">DVC metafiles</a> during
<a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> predictions
<span class="token key atrule">files</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 0.txt
<span class="token key atrule">md5</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca
<span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span>
<span class="token key atrule">myremote</span><span class="token punctuation">:</span>
<span class="token key atrule">etag</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca
<span class="token key atrule">version_id</span><span class="token punctuation">:</span> UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7
<span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 1.txt
<span class="token key atrule">md5</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb
<span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span>
<span class="token key atrule">cloud</span><span class="token punctuation">:</span>
<span class="token key atrule">myremote</span><span class="token punctuation">:</span>
<span class="token key atrule">etag</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb
<span class="token key atrule">version_id</span><span class="token punctuation">:</span> fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz</code></pre></div>
<p>This means you need to be more careful about the order in which you <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>
and <code>git commit</code>. You should first <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and then <code>git commit</code> since
pushing will modify the DVC metafiles. This might seem odd, but it means you
have a record in Git of what was pushed, so there is no more guessing whether
you remembered to push.</p>
<h1 id="quickstart" style="position:relative;">Quickstart<a href="#quickstart" aria-label="quickstart permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>You can start with DVC cloud versioning in 3 steps:</p>
<p><strong>1. Check whether cloud versioning is enabled for your bucket/storage account,
and enable it if it's not.</strong></p>
<ul>
<li><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html" target="_blank" rel="nofollow noopener noreferrer">Amazon S3</a></li>
<li><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-enable" target="_blank" rel="nofollow noopener noreferrer">Azure Storage</a></li>
<li><a href="https://cloud.google.com/storage/docs/using-object-versioning" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Storage</a></li>
</ul>
<p><strong>2. Setup DVC to use that bucket/container as cloud-versioned remote
storage.</strong></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span>
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">--default</span> myremote s3://cloud-versioned-bucket/path
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote version_aware <span class="token boolean">true</span></span></code></pre></div>
<p><strong>3. Add and then push data.</strong></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> model.pt predictions
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div>
<hr>
<p>Stop messing around with backing up your cloud data! With cloud versioning in
DVC, you can iterate on your data as much as you want without losing track of
your changes or worrying about your storage growing into an unmanageable mess.</p>
<p>Special thanks to <a href="https://github.com/pmrowla" target="_blank" rel="nofollow noopener noreferrer">Peter Rowlands</a> for leading the
development of this new capability!</p>https://dvc.org/blog/dvclive-metrics-studiohttps://dvc.org/blog/dvclive-metrics-studioMon, 13 Feb 2023 00:00:00 GMT<p>Computer vision is a complex field requiring much experimentation and trial and
error to achieve optimal results. However, managing and tracking the progress of
these experiments has not been easy. You can't see it once you've sent it to the
server for training. Keeping an eye on its progress over (often) days makes it
possible to miss something. This makes it difficult to effectively manage your
time and reduce unnecessary resource use. Moreover, a team working on the same
project needs to be able to easily share their results with colleagues. This can
be challenging with existing (or non-existent) tooling.</p>
<p>That's where DVCLive and Iterative Studio come in. These tools offer live
experiment tracking and efficient result sharing, making it easy to optimize
your experimentation process and streamline the workflow with your team.</p>
<p><img src="https://dvc.org/2023-02-13/live_plots-d71c91466267bddf7bf4fcd3598eaee6.gif" alt="Real-time experiment tracking in Iterative Studio">
<em>See experiment results in real-time in Iterative Studio</em></p>
<h3 id="the-tools-at-work" style="position:relative;">The tools at work<a href="#the-tools-at-work" aria-label="the tools at work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> is a Python library connected to DVC that
provides a real-time experiment logger that allows machine learning engineers to
track the metrics and parameters of their experiments. It is beneficial for
long-running experiments, which can take hours or even days to complete.</p>
<p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a
<a href="https://en.wikipedia.org/wiki/Software_as_a_service" target="_blank" rel="nofollow noopener noreferrer">SaaS</a> platform that
displays logged experiments with their metrics, parameters, and plots all tied
together and tracked using DVC and Git under the hood. It allows for rich,
visual, real-time tracking and sharing of the results, making it easy to
collaborate with others and be production-ready efficiently.</p>
<p><img src="https://dvc.org/2023-02-13/live_metrics-da579bff70aae9578c94bf2843a92139.gif" alt="Real-time, nested experiment tracking in Iterative Studio">
<em>Real-time, nested experiment tracking in Iterative Studio</em></p>
<h3 id="use-case-identifying-and-segmenting-pools-from-satellite-imagery" style="position:relative;">Use case: Identifying and segmenting pools from satellite imagery<a href="#use-case-identifying-and-segmenting-pools-from-satellite-imagery" aria-label="use case identifying and segmenting pools from satellite imagery permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In this computer vision project (see repo
<a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">here</a>), we embark
on an exciting journey to uncover swimming pools, often obscured from
street-level views, right in the middle of our neighborhoods and cities. Using
<a href="https://www.mathworks.com/help/deeplearning/ref/resnet18.html" target="_blank" rel="nofollow noopener noreferrer">ResNet-18</a> and
<a href="https://www.fast.ai/" target="_blank" rel="nofollow noopener noreferrer">Fast.ai</a>, we will be able to accurately identify and
segment pools from satellite images.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d2c6e98d28a4eb4add6f657cf363e5a3/39600/bh-pools-dataset.png" alt="BH-Pools Dataset" title="BH-Pools Dataset" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Images and ground truth segmentation of BH-Pools Dataset
(<a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<admon type="info">
<p>It's worth noting that the experiment in this example is beyond a toy project by
design. It may take around one hour to run on an ordinary laptop, and the time
may vary depending on the specific configuration and settings. However, you can
use a GPU to speed up the process.</p>
</admon>
<h3 id="dataset-methods--tools" style="position:relative;">Dataset, Methods & Tools<a href="#dataset-methods--tools" aria-label="dataset methods tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We will use a modified version of the
<a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">BH-Pools dataset</a>,
which consists of high-resolution 4K images of various neighborhoods in the city
of Belo Horizonte, Brazil. These images were captured through Google Earth Pro
and come pre-annotated with swimming pools and water tanks. For this project, we
will focus on just the swimming pools.</p>
<p>We have made the dataset more manageable with some pre-processing to crop the
images into smaller tiles of 1024x1024 pixels.</p>
<p>When using DVCLive in Iterative Studio, we will be able to see the progress of
our experiments. Let’s get started!</p>
<h3 id="getting-set-up" style="position:relative;">Getting Set up<a href="#getting-set-up" aria-label="getting set up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Follow the initial setup instructions in the
<a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">README</a>. Next, we
need to run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> in our root directory to fetch the dataset from our
remote. This command retrieves the data from the remote storage and makes it
available locally for our experiments. Once the download is complete, we will
create a data loader using the label function with <code>SegmentationDataLoaders</code>
from the <code>fastai</code> library. This data loader allows us to easily load and
preprocess the images (e.g. resizing the images to the desired resolution). You
can dig deeper into the code
<a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/train.py#:~:text=/%20%22train_data%22-,data_loader%20%3D%20SegmentationDataLoaders.from_label_func(,),-model_names%20%3D%20%5B" target="_blank" rel="nofollow noopener noreferrer">here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bcb76ec2e14257ad23b0dd23be082271/39600/swimming-pools-dataset.png" alt="BH-Pools Dataset" title="BH-Pools Dataset" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Sample of Belo Horizonte Pools Dataset from <code>data_loader</code>
(<a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>After creating the data loader and resizing the images, we train a ResNet-18
model with unet_learner with varying hyperparameters and utilizing the
DVCLiveCallback. The DVCLiveCallback is a built-in logger provided by DVCLive
that allows us to track the intermediate results of the training process, such
as the loss and accuracy of the model, in real-time. By logging these metrics,
we can easily monitor the progress of our model and make adjustments as needed
to optimize the training process and improve the performance of the model.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> learn <span class="token operator">=</span> unet_learner<span class="token punctuation">(</span>
data_loader<span class="token punctuation">,</span> arch<span class="token operator">=</span><span class="token builtin">getattr</span><span class="token punctuation">(</span>models<span class="token punctuation">,</span> params<span class="token punctuation">.</span>train<span class="token punctuation">.</span>arch<span class="token punctuation">)</span><span class="token punctuation">,</span> metrics<span class="token operator">=</span>DiceMulti
<span class="token punctuation">)</span>
learn<span class="token punctuation">.</span>fine_tune<span class="token punctuation">(</span>
<span class="token operator">**</span>params<span class="token punctuation">.</span>train<span class="token punctuation">.</span>fine_tune_args<span class="token punctuation">,</span>
cbs<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span><span class="token builtin">dir</span><span class="token operator">=</span><span class="token string">"results/train"</span><span class="token punctuation">,</span> report<span class="token operator">=</span><span class="token string">"md"</span><span class="token punctuation">,</span> dvcyaml<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span></code></pre></div>
<p>Additionally, we can also use Studio to analyze and visualize the results of our
experiments, making it easy to share and collaborate with others.
<a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">By providing the STUDIO_TOKEN</a>,
DVCLive will automatically post the results of the experiment to Studio. To do
this, first, let’s obtain an individual token from the user profile page in
Studio.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/72a87ac1955f69ffe2ff18effb77181c/39600/studio-access-token.png" alt="Generate Iterative Studio Access token" title="Generate Iterative Studio Access token" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Generating Studio Access Token in the Iterative Studio Profile page
(<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>By providing this token as an environment variable, we can access the results of
our experiments in an
<a href="https://dvc.org/doc/studio/get-started" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio project</a>. The project
lets you compare them with previous experiments, helps you find insights to
improve our model and share it with others.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/05b43930abaa033ef494a947a7cad639/39600/iterative-studio-live-metrics.png" alt="Comparison in Iterative Studio" title="Comparison in Iterative Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Compare with previous experiments in Iterative Studio
(<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>To export the token run the command below with the token obtained from your
Studio profile:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">STUDIO_TOKEN</span><span class="token operator">=</span><span class="token operator"><</span>your-token<span class="token operator">></span></code></pre></div>
<p>Running an experiment locally using DVC will now automatically live-update the
Studio project(s) associated with your git remote (the one named "origin")</p>
<p>You may want to change the parameters and run the experiment again.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc exp run <span class="token parameter variable">-S</span> <span class="token assign-left variable">train.fine_tune_args.epochs</span><span class="token operator">=</span><span class="token number">16</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">train.img_size</span><span class="token operator">=</span><span class="token number">512</span></code></pre></div>
<p><img src="https://dvc.org/2023-02-13/exp-run-ed8ebb8ac1c5606d7490f8eeee498460.gif" alt="Experiment tracking in Iterative Studio" title="=800">
<em>Real-time Experiment tracking in Iterative Studio
(<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>As you can see, the change to the epochs and image size brought improvement to
the metrics.</p>
<p>It's safe to say that if you provide the model with a satellite image of any
neighborhood, it will pretty accurately identify all swimming pools in that
image! And by using DVCLive and Studio, we were able to track and efficiently
control the model training process, without squandering expensive training
resources on unfruitful training runs.</p>
<h3 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our work has produced a model which is able to accurately identify and segment
swimming pools from satellite images! With the help of DVCLive and Iterative
Studio, we've been able to visualize results in real-time to make
resource-saving decisions. And finally, this work is readily visible for the
entire team to review!</p>
<p>We’d like to express our gratitude to the creators of the incredible
<a href="http://patreo.dcc.ufmg.br/about-us/" target="_blank" rel="nofollow noopener noreferrer">BH-Pools dataset</a>, without which there
would have been less fun and less impressive results!</p>
<p>You can give Iterative Studio a try by signing up
<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">here</a>. Try out
the <a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">repo</a>
or <a href="https://colab.research.google.com/drive/1NTivljRYiySMJn-SHeWQSycBmSOVUbvA" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a>
for this project and let us know what you think
in <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or
<a href="https://discuss.dvc.org/t/track-computer-vision-experiments-in-real-time-with-dvclive-in-iterative-studio/1478" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>!</p>
<admon type="info">
<p>Learn more about enhancing your machine learning experimentation with these blog
posts:</p>
<ul>
<li><a href="https://iterative.ai/blog/exp-tracking-dvc-python" target="_blank" rel="nofollow noopener noreferrer">Experiment Tracking with DVC and Python</a></li>
<li><a href="https://iterative.ai/blog/dvc-hydra-integration/" target="_blank" rel="nofollow noopener noreferrer">DVC and Hydra Integration</a>.</li>
</ul>
</admon>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/mlem-nanogpt-modal-flyiohttps://dvc.org/blog/mlem-nanogpt-modal-flyioWed, 08 Feb 2023 00:00:00 GMT<h2 id="preparing-data" style="position:relative;">Preparing data<a href="#preparing-data" aria-label="preparing data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To kick off the process, you basically just need a single text file that you
want your model to be trained on. For example, I often struggle with writing
docs for MLEM framework, so I will try to generate those.
<a href="https://github.com/mike0sv/nanoGPT/blob/mlem/data/mlem-docs/prepare.py" target="_blank" rel="nofollow noopener noreferrer">Here</a>
you can find my code that clones
<a href="https://github.com/iterative/mlem.ai" target="_blank" rel="nofollow noopener noreferrer">mlem.ai repo</a>, compiles every <code>.md</code> from
the docs directory into a single text file and then creates a train set using
the same code as an example Shakespeare dataset. I also prepended each file’s
content with the path to this file, so I can condition the generation for a
specific file.</p>
<p>Of course, for your own experiments, you can provide different data and train
GPT model for a different task.</p>
<h2 id="training-the-model" style="position:relative;">Training the model<a href="#training-the-model" aria-label="training the model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thanks to Andrej’s original repo, it’s as easy as cloning and running a couple
of commands. My fork has some additional stuff to make it even easier.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">git</span> clone https://github.com/mike0sv/nanoGPT <span class="token operator">&&</span> <span class="token builtin class-name">cd</span> nanoGPT/ <span class="token operator">&&</span> <span class="token function">git</span> checkout <span class="token parameter variable">-b</span> mlem origin/mlem
$ pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements-mlem.txt
<span class="token comment"># Prepare mlem docs dataset</span>
<span class="token comment"># Alternatively, you can compile your own training data for different task</span>
$ python data/mlem-docs/prepare.py char</code></pre></div>
<p>If you don’t have access to GPU, you can use <a href="http://modal.com" target="_blank" rel="nofollow noopener noreferrer">modal.com</a> to
train your model without any infrastructure configuration. Just register there,
wait for approval, and run
<a href="https://github.com/mike0sv/nanoGPT/blob/mlem/modal_train.py" target="_blank" rel="nofollow noopener noreferrer">this script</a> to
run the training and download the resulting model checkpoint.</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ modal token new <span class="token comment"># approve in browser</span>
$ python modal_train.py <span class="token comment"># you can edit paths or other parameters</span></code></pre></div>
<p>Or if you are already working on a machine with GPU, just run the training
locally</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># train model</span>
$ python train.py config/train_mlemai.py <span class="token parameter variable">--device</span> cuda <span class="token parameter variable">--dtype</span><span class="token operator">=</span>float32 <span class="token parameter variable">--max_iters</span><span class="token operator">=</span><span class="token number">3000</span> <span class="token parameter variable">--init_from</span><span class="token operator">=</span>scratch</code></pre></div>
<p>After training you model will be saved at <code>out-mlemai-char/ckpt.pt</code> and you can
sample it with</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># sample model</span>
$ python sample.py <span class="token parameter variable">--out_dir</span><span class="token operator">=</span>out-mlemai-char <span class="token parameter variable">--dtype</span><span class="token operator">=</span>float32</code></pre></div>
<h2 id="deploying-your-model" style="position:relative;">Deploying your model<a href="#deploying-your-model" aria-label="deploying your model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, to show off your model to friends and colleagues, we will deploy it as a
<a href="https://streamlit.io" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a> application to <a href="https://fly.io" target="_blank" rel="nofollow noopener noreferrer">https://fly.io</a>. It’s very easy
with <a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> Streamlit extension. First, we need to save the
model as MLEM model -
<a href="https://github.com/mike0sv/nanoGPT/blob/mlem/wrapper.py" target="_blank" rel="nofollow noopener noreferrer">here</a> is the script
for that</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python wrapper.py out-mlemai-char mlem_char</code></pre></div>
<p>Now, setup and login into <a href="https://fly.io/docs/hands-on/install-flyctl/" target="_blank" rel="nofollow noopener noreferrer">fly.io</a>
and run <code>mlem deploy</code> command. I also prepared a
<a href="https://github.com/mike0sv/nanoGPT/blob/mlem/app.py" target="_blank" rel="nofollow noopener noreferrer">custom Streamlit application template</a>
you can use to give it more ChatGPT feel</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">
<span class="token comment"># setup flyio</span>
$ flyctl auth login
$ mlem deploy run flyio app <span class="token parameter variable">-m</span> mlem_char <span class="token punctuation">\</span>
<span class="token parameter variable">--app_name</span> mlem-nanogpt <span class="token parameter variable">--scale_memory</span> <span class="token number">1024</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--server</span> streamlit <span class="token parameter variable">--server.ui_port</span> <span class="token number">8080</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--server.server_port</span> <span class="token number">8081</span> <span class="token parameter variable">--server.template</span> app.py</code></pre></div>
<p>After the command finishes, just go to https://<app_name>.fly.dev - in my case
its <a href="https://mlem-nanogpt.fly.dev/" target="_blank" rel="nofollow noopener noreferrer">https://mlem-nanogpt.fly.dev/</a> - and start
chatting.</p>
<p><img src="https://dvc.org/2023-02-08/app-03314cb2a611e772a98a57b05f8e5a77.gif" alt="app.gif"></p>
<p>Well, I guess if this is what generated docs look like, I still have a job! 🤣</p>
<p>But just for lulz, I re-generated the whole MLEM documentation with this model -
you can check it out
<a href="https://mlem-ai-nano-gpt-xyinoh8xgobdz.herokuapp.com/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Nowadays it’s really easy to recreate someone else’s work thanks to open source
software. And thanks to folks like Andrej and companies like Modal and Fly now
it becomes much faster to build and deploy ML models. We are happy to be part of
this, with tools like MLEM, DVC, CML and others. Long live the open source!</p>https://dvc.org/blog/mlem-cv-model-deploymenthttps://dvc.org/blog/mlem-cv-model-deploymentThu, 19 Jan 2023 00:00:00 GMT<p>By developing MLEM - a tool that allows researchers to easily deploy their
models to production without having to worry about the underlying
infrastructure, we strive to help them focus on what they do best: developing
and improving their models. This can help accelerate the pace of research and
development, and ultimately lead to better and more effective AI systems.</p>
<p>MLEM deploy your models in a couple of commands - and in this blog post, we’ll
deploy an image classification model to <a href="https://fly.io" target="_blank" rel="nofollow noopener noreferrer">Fly.io</a>. Without any
additional user input, MLEM will serve your model with REST API, create a
Streamlit application, and build a Docker image with both included. Does this
sound like fun? Try out the deployment at <a href="https://mlem-cv.fly.dev" target="_blank" rel="nofollow noopener noreferrer">https://mlem-cv.fly.dev</a> before we
start!</p>
<h2 id="the-good-part" style="position:relative;">The good part<a href="#the-good-part" aria-label="the good part permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To showcase MLEM power we’ll take a pytorch model and deploy it to the cloud in
a couple of simple steps. Just don’t forget to install MLEM and other
requirements with <code>pip install torch torchvision mlem[streamlit,flyio]</code>. You’ll
also need docker up and running on your machine.</p>
<p>First, we need to get the model. To get to model deployment faster, we won’t
dive too far into model development and stick to the task at hand by using a
pre-trained ResNet model from <code>torchvision</code>:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> torchvision<span class="token punctuation">.</span>models <span class="token keyword">import</span> ResNet50_Weights<span class="token punctuation">,</span> resnet50
weights <span class="token operator">=</span> ResNet50_Weights<span class="token punctuation">.</span>DEFAULT
model <span class="token operator">=</span> resnet50<span class="token punctuation">(</span>weights<span class="token operator">=</span>weights<span class="token punctuation">)</span>
model<span class="token punctuation">.</span><span class="token builtin">eval</span><span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
<p>Since our model expects tensors of a certain shape, we need some preprocessing
to be able to use it with an arbitrary image. And while we’re here, let’s throw
some postprocessing on top to get class name from predicted class probabilities.
Thankfully, MLEM allows you to do just that:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> torchvision<span class="token punctuation">.</span>io <span class="token keyword">import</span> read_image
<span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> save
img <span class="token operator">=</span> read_image<span class="token punctuation">(</span><span class="token string">"cat.jpg"</span><span class="token punctuation">)</span>
categories <span class="token operator">=</span> weights<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">"categories"</span><span class="token punctuation">]</span>
preprocess <span class="token operator">=</span> weights<span class="token punctuation">.</span>transforms<span class="token punctuation">(</span><span class="token punctuation">)</span>
save<span class="token punctuation">(</span>model<span class="token punctuation">,</span> <span class="token string">"torch_resnet"</span><span class="token punctuation">,</span>
preprocess<span class="token operator">=</span><span class="token keyword">lambda</span> x<span class="token punctuation">:</span> preprocess<span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">.</span>unsqueeze<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
postprocess<span class="token operator">=</span><span class="token keyword">lambda</span> x<span class="token punctuation">:</span> categories<span class="token punctuation">[</span>
x<span class="token punctuation">.</span>squeeze<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>softmax<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>argmax<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">]</span><span class="token punctuation">,</span>
sample_data<span class="token operator">=</span>img<span class="token punctuation">,</span>
<span class="token punctuation">)</span></code></pre></div>
<p>MLEM will do its metadata-extracting magic on our model, so we get
ready-to-serve MLEM Model at <code>torch_resnet</code> path.</p>
<p>Now we’re ready for deployment, but before we’d like to play around with it
locally. We can use <a href="https://mlem.ai/doc/command-reference/serve" target="_blank" rel="nofollow noopener noreferrer"><code>mlem serve</code></a>
to see how it works:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ mlem serve streamlit <span class="token punctuation">\</span>
<span class="token parameter variable">--model</span> torch_resnet <span class="token punctuation">\</span>
<span class="token parameter variable">--request_serializer</span> torch_image <span class="token comment"># accept images instead of raw tensors</span>
Starting streamlit server<span class="token punctuation">..</span>.
🖇️ Adding route <span class="token keyword">for</span> /predict
Checkout openapi docs at <span class="token operator"><</span>http://0.0.0.0:8080/docs<span class="token operator">></span>
INFO: Started server process <span class="token punctuation">[</span><span class="token number">17525</span><span class="token punctuation">]</span>
INFO: Waiting <span class="token keyword">for</span> application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 <span class="token punctuation">(</span>Press CTRL+C to quit<span class="token punctuation">)</span>
You can now view your Streamlit app <span class="token keyword">in</span> your browser.
URL: http://0.0.0.0:80</code></pre></div>
<p>Let's head over to <a href="http://localhost:80" target="_blank" rel="nofollow noopener noreferrer">localhost:80</a> to see if our model is
ready for production!</p>
<p><img src="https://dvc.org/2023-01-19/streamlit-1fd30393f4cbab125953036101ec878f.gif" alt="Streamlit app"></p>
<p>This is already useful: you can play around with your model, demo it to
colleagues in a call, or show your pet how it's going to be classified now. Tons
of ways to use this - give it a try when in need the next time!</p>
<h2 id="cloudification" style="position:relative;">Cloudification<a href="#cloudification" aria-label="cloudification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>That's cool and all, but what is your model worth if you need to call your
friends each time to show it off? MLEM can help in this department too.
<a href="https://mlem.ai/doc/command-reference/deployment" target="_blank" rel="nofollow noopener noreferrer">Using <code>mlem deploy</code></a> you can
deploy your model to Heroku, Sagemaker, Kubernetes or Flyio (not to mention
<a href="https://mlem.ai/doc/command-reference/build" target="_blank" rel="nofollow noopener noreferrer"><code>mlem build</code></a> that can build a
Docker image out of your model that you can later deploy yourself).</p>
<p>Since a PR for <a href="http://fly.io" target="_blank" rel="nofollow noopener noreferrer">fly.io</a> was just merged, let’s use it:</p>
<ul>
<li>Go to <a href="http://fly.io" target="_blank" rel="nofollow noopener noreferrer">fly.io</a> and set up an account</li>
<li>Install flyctl using
<a href="https://fly.io/docs/hands-on/install-flyctl/" target="_blank" rel="nofollow noopener noreferrer">this instruction</a></li>
<li>Login via <code>flyctl auth login</code></li>
<li>You also need to provide a credit card, but they won't charge you
<a href="https://fly.io/docs/about/pricing/#how-it-works" target="_blank" rel="nofollow noopener noreferrer">until you exceed free limits</a>.</li>
</ul>
<p>Now normally we’d need to write <code>Dockerfile</code>, <code>requirements.txt</code> and other
deployment-platform-specific files like <code>Procfile</code>, and then finally use
<code>flyctl</code> executable to run an app. But fortunately, we can just run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ mlem deploy run flyio cv-app <span class="token punctuation">\</span>
<span class="token parameter variable">--model</span> torch_resnet <span class="token punctuation">\</span>
<span class="token parameter variable">--app_name</span> mlem-cv <span class="token punctuation">\</span>
<span class="token parameter variable">--scale_memory</span> <span class="token number">1024</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--server</span> streamlit <span class="token punctuation">\</span>
<span class="token parameter variable">--server.request_serializer</span> torch_image <span class="token punctuation">\</span>
<span class="token parameter variable">--server.ui_port</span> <span class="token number">8080</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--server.server_port</span> <span class="token number">8081</span></code></pre></div>
<p>Now it’s live at <a href="https://mlem-cv.fly.dev" target="_blank" rel="nofollow noopener noreferrer">mlem-cv.fly.dev</a> 🚀</p>
<p>Finally, all you have to do now is to brag to your best friend about your
achievement:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7c92549a130c055a72fbe0829ae7cf58/39600/best-friend.png" alt="ChatGPT" title="ChatGPT" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="whats-next" style="position:relative;">What's next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As we promised in our
<a href="https://iterative.ai/blog/mlem-k8s-sagemaker/" target="_blank" rel="nofollow noopener noreferrer">last MLEM blog post</a>, we added
support for CV models and models that have preprocessing or postprocessing
steps. What's next?</p>
<ul>
<li>We're looking at integrations with specialized CV serving tools like
TorchServe, GPU support, and model optimization.</li>
<li>We already
<a href="https://medium.com/better-programming/i-trained-a-model-to-tell-if-you-were-naughty-this-year-11a36ca6d472" target="_blank" rel="nofollow noopener noreferrer">support NLP scenarios</a>,
but we're going to see if there is something special that needs to be
implemented there as well.</li>
</ul>
<p>Feel free to drop us a line in
<a href="https://github.com/iterative/mlem/issues" target="_blank" rel="nofollow noopener noreferrer">GH issues</a> if you'd like something
specific! See you next time 🐶</p>https://dvc.org/blog/january-2023-heartbeathttps://dvc.org/blog/january-2023-heartbeatTue, 17 Jan 2023 00:00:00 GMT<p>Happy New Year! We are looking forward to what’s going to be a stellar year for
us and for all of you! We are hoping for peace to reign, the recession to
subside, and success aplenty. 🤞🏼 Are you ready? Let’s do this!</p>
<p><img src="https://media.giphy.com/media/JykvbWfXtAHSM/giphy.gif" alt="Lets Do This GIF by National Geographic Channel"></p>
<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We always start with DVC, but this month, in this new year, we’ll start with
MLEM! We released MLEM in June of last year and have made
<a href="https://iterative.ai/blog/mlem-k8s-sagemaker" target="_blank" rel="nofollow noopener noreferrer">some advances to it already</a>. It
seems the Community is learning about it and recognizing its benefits. We are
thrilled to see that!</p>
<h2 id="mlem-tutorial-video-from-jcharis-jesse" style="position:relative;">MLEM Tutorial Video from JCharis Jesse<a href="#mlem-tutorial-video-from-jcharis-jesse" aria-label="mlem tutorial video from jcharis jesse permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/JCharisTech" target="_blank" rel="nofollow noopener noreferrer"><strong>JCharis Jesse</strong></a> created the
<a href="https://www.youtube.com/watch?v=vEoc64xJaK4" target="_blank" rel="nofollow noopener noreferrer">FIRST video tutorial from the Community for MLEM!</a>
In this very well-explained and recorded video, Jesse takes you through what
MLEM is and where it fits in the machine learning to production process. He
follows that by showing the different options of saving a model, where to find
the model metadata and how it works, loading the ML model, examples of serving
with FastAPI and Docker, and finally applying the model to data for prediction.
If you are interested in using MLEM for serving your models, this will
definitely help get you started! You can find a ton of other great content on
his <a href="https://www.youtube.com/@JCharisTech" target="_blank" rel="nofollow noopener noreferrer">YouTube site</a>.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/vEoc64xJaK4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="tryolabs-top-python-libraries-of-2022" style="position:relative;">Tryolabs Top Python Libraries of 2022<a href="#tryolabs-top-python-libraries-of-2022" aria-label="tryolabs top python libraries of 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>From our friends at <a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">Tryolabs</a>,
<a href="https://www.linkedin.com/in/alan-descoins/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alan Descoins</strong></a> and
<a href="https://www.linkedin.com/in/facundo-lezama/" target="_blank" rel="nofollow noopener noreferrer"><strong>Facundo Lezama</strong></a> round out 2022
with
<a href="https://tryolabs.com/blog/2022/12/26/top-python-libraries-2022" target="_blank" rel="nofollow noopener noreferrer">Tryolabs’ annual picks for the best Python Libraries of 2022</a>.
The requirements to make the cut are for libraries that were launched or gained
popularity within the year. They have a list of top 10 picks that you will want
to take a look at, including <a href="https://lineapy.org/" target="_blank" rel="nofollow noopener noreferrer">LineaPy</a> which helps you
convert notebooks to production pipelines. MLEM also made the list in the
category of <em>Tools & Enablers</em>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fb378179100dbdc49e9db4e80afeb3ac/39600/tryolabs.png" alt="Tryolabs" title="Tryolabs" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Tryolabs Best
Python Libraries of 2022
(<a href="https://tryolabs.com/blog/2022/12/26/top-python-libraries-2022" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="bex-tuychiev---data-version-control-learn-what-other-data-scientists-are-ignoring" style="position:relative;">Bex Tuychiev - Data Version Control: Learn What Other Data Scientists Are Ignoring<a href="#bex-tuychiev---data-version-control-learn-what-other-data-scientists-are-ignoring" aria-label="bex tuychiev data version control learn what other data scientists are ignoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b9d24af1b8f8422fd44012394ef91049/03346/fiona-art.jpg" alt="Learn What Other Data Scientists are Ignoring with DVC" title="Photo by Fiona Art from Pexels" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
In the first part of a new series on DVC,
<a href="https://www.linkedin.com/in/bextuychiev/" target="_blank" rel="nofollow noopener noreferrer"><strong>Bex Tuychiev</strong></a> writes a fire 🔥
tutorial on DVC in
<a href="https://towardsdatascience.com/how-to-version-gigabyte-sized-datasets-just-like-code-with-dvc-in-python-5197662e85bd" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science</a>
with a computer vision project using the German Traffic Sign Recognition
Benchmark Dataset and Tensorflow. He guides you on getting the project properly
set up, then how to start adding, tracking, pulling, and pushing files with DVC.
Next, he goes over building the image classification model and then concludes
with how to create a shared cache if you are working on a large project with a
team. Reproducibility and Collaboration for the win! We are looking forward to
the next parts of the series!</p>
<p><img src="https://media.giphy.com/media/epxDzItQhxAzK/giphy.gif" alt="It Crowd Popcorn GIF"></p>
<h2 id="aryan-jadon---survey-of-data-versioning-tools-for-machine-learning-operations" style="position:relative;">Aryan Jadon - Survey of Data Versioning Tools for Machine Learning Operations<a href="#aryan-jadon---survey-of-data-versioning-tools-for-machine-learning-operations" aria-label="aryan jadon survey of data versioning tools for machine learning operations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For a very nice comparison of Data Versioning Tools, look to
<a href="https://www.linkedin.com/in/aryan-jadon/" target="_blank" rel="nofollow noopener noreferrer"><strong>Aryan Jadon’s</strong></a>
<a href="https://medium.com/@aryanjadon/analysis-of-data-versioning-tools-for-machine-learning-operations-1cb27146ce49" target="_blank" rel="nofollow noopener noreferrer">recent post on the subject</a>.
He seems to hit them all, providing information about their benefits and things
of which to be cautious. Naturally, DVC makes this list with the only caution
being, “you need to use a Git repository to use DVC’s versioning features."
Isn’t Git a part of every modern tech stack? 😉 Staying true to our mission to
deliver the best developer experience for machine learning teams by creating an
ecosystem of open, modular ML tools!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab7a7b82f3683b956095a8b0e40529eb/39600/aryan-jadon.png" alt="Survey of Data Versioning Tools for Machine Learning Operations" title="Survey of Data Versioning Tools for Machine Learning Operations" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Deciding on Data Versioning Tools?
(<a href="https://medium.com/@aryanjadon/analysis-of-data-versioning-tools-for-machine-learning-operations-1cb27146ce49" target="_blank" rel="nofollow noopener noreferrer">Source link by Mary Amato </a>)</em></p>
<h2 id="sami-jawhar---running-parallel-pipelines-with-dvc-and-tpi" style="position:relative;">Sami Jawhar - Running Parallel Pipelines with DVC and TPI<a href="#sami-jawhar---running-parallel-pipelines-with-dvc-and-tpi" aria-label="sami jawhar running parallel pipelines with dvc and tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you couldn’t make the December Meetup, good news!
<a href="https://youtu.be/X3M1UfMn2Kk" target="_blank" rel="nofollow noopener noreferrer">The video</a> is already out!
<a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> joined us
to share a solution he built to run parallel pipelines with DVC and TPI to save
time processing the massive amount of data they use in their brain research at
<a href="https://www.kernel.com/" target="_blank" rel="nofollow noopener noreferrer">Kernel</a>. He describes the context of his situation as
well as all of its constraints and finally the details of the solution, coined
“Neuromancer” after the famous sci-fi novel. Get ready for some mind-blowing
engineering! 🤯</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/X3M1UfMn2Kk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="mlem-christmas-project" style="position:relative;">MLEM Christmas Project<a href="#mlem-christmas-project" aria-label="mlem christmas project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/KtrhyNGwNCSYM4pVRq/giphy.gif" alt="Have you been Naughty or Nice?" title="Naughty or Nice MLEMMing" style="width: 300px; float: right; clear: left; padding: 0.5rem">
In case you missed it while you were out for the holidays,
<a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Guschin</strong></a> and
<a href="https://www.linkedin.com/in/mike0sv/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Sveshnikov</strong></a>, your friendly
neighborhood MLEM creators, put together
<a href="https://medium.com/@mike0sv/i-trained-a-model-to-tell-if-you-were-naughty-this-year-11a36ca6d472" target="_blank" rel="nofollow noopener noreferrer">a fun project using MLEM</a>
to determine if you had been naughty or nice just ahead of Santa’s trot around
the globe in 2022. In the blog post, you will learn how they DDOS’ed Santa’s
website, Trained a Christmas (decision) tree, and Deployed a ML service with
MLEM to <a href="https://streamlit.io/" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a> to see the predictions.</p>
<p>You can try it out <a href="https://mlem-nice-or-naughty.fly.dev/" target="_blank" rel="nofollow noopener noreferrer">here</a>. And check out
how some of our team members fared in
<a href="https://www.linkedin.com/posts/1aguschin_streamlit-activity-7012056418816036864-k9hv?utm_source=share&utm_medium=member_desktop" target="_blank" rel="nofollow noopener noreferrer">this LinkedIn post</a>.
Spoiler alert: I’m naughty and nice?</p>
<h2 id="casper-da-costa-luis-at-mlops-summit---painless-cloud-experiments-without-leaving-your-ide" style="position:relative;">Casper da Costa-Luis at MLOps Summit - Painless cloud experiments without leaving your IDE<a href="#casper-da-costa-luis-at-mlops-summit---painless-cloud-experiments-without-leaving-your-ide" aria-label="casper da costa luis at mlops summit painless cloud experiments without leaving your ide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our CML Product Manager,
<a href="https://github.com/casperdcl" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa-Luis'</strong></a> presented in November
at MLOps Summit on <em>Painless cloud experiments without leaving your IDE</em>. The
presentation is now available on YouTube
<a href="https://www.youtube.com/watch?v=PaBQF89URuI" target="_blank" rel="nofollow noopener noreferrer">here</a>. If Full lifecycle
management of computing resources (including GPUs and auto-respawning spot
instances) from several cloud vendors (AWS, Azure, GCP, K8s)… without needing
to be a cloud expert appeals, this talk is for you! He discusses how to move
experiments seamlessly between a local laptop, a powerful cloud machine, and
your CI/CD of choice.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/PaBQF89URuI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="new-unstructured-data-query-language" style="position:relative;">New Unstructured Data Query Language<a href="#new-unstructured-data-query-language" aria-label="new unstructured data query language permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a
new solution for finding and managing your datasets of unstructured data like
images, audio files, and PDFs!</strong> Extend your DVC environment with the first
unstructured data query language (think SQL -> DQL) for machine learning. We are
looking for beta customers for this new tool.</p>
<p><a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">Schedule a meeting with us</a>
if that's what you're needing! Find more info
<a href="https://iterative.ai/data-catalog-for-ml" target="_blank" rel="nofollow noopener noreferrer">here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2edbe3221d465dd67d0de2903ebd6c73/39600/dvc-cloud.png" alt="Unstructured Data Query Language from the makers of DVC" title="Unstructured Data Query Language from the makers of DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Unstructured Data Query Language Prototype</em></p>
<h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our favorite Tweet this month is from
<a href="https://twitter.com/the_osbm" target="_blank" rel="nofollow noopener noreferrer"><strong>Osman Bayram</strong></a> who mentions he plans to use
CML with <a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Huggingface</a> GPU. We are looking forward to
that! 🍿 I'm seeing a lot of popcorn eating in our future. See you next month!</p>
<p><a href="https://twitter.com/the_osbm/status/1606018332175478786?s=20&t=uTKIsTjTv5frJPz2yNPqUw" target="_blank" rel="nofollow noopener noreferrer">Link to Tweet</a></p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/december-2022-heartbeathttps://dvc.org/blog/december-2022-heartbeatFri, 16 Dec 2022 00:00:00 GMT<admon type="tip">
<p>Unlike most of the text you've read over the past two weeks, this Heartbeat was
100% human generated. 😉</p>
</admon>
<p>Welcome to December! Wow, what a year! We introduced an online course, added
five new tools (TPI, GTO, MLEM, DVC Extension for VS Code, and a Model Registry
in Iterative Studio) plus tons of new features to DVC, CML, and Iterative
Studio. We also were thrilled to emerge from the pandemic and meet so many of
you in person at conferences around the world. We are excited about what's in
store for 2023, and we thank you all for being such fantastic community members.
While there are still challenging events happening around the globe, there is
much to be thankful for and victories to celebrate! Bring on 2023!</p>
<p><img src="https://media.giphy.com/media/DEZA7FlHbMesUF1jm9/giphy.gif" alt="Believe Jason Sudeikis GIF by Apple TV"></p>
<h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="mlops-guide" style="position:relative;">MLOps Guide<a href="#mlops-guide" aria-label="mlops guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For their engineering final project at <a href="https://www.insper.edu.br/en/" target="_blank" rel="nofollow noopener noreferrer">Insper,</a>
<a href="https://github.com/arthurolga" target="_blank" rel="nofollow noopener noreferrer"><strong>Arthur Olga</strong></a>, <a href="https://github.com/gabriellm1" target="_blank" rel="nofollow noopener noreferrer"><strong>Gabriel Monteiro</strong></a>, <a href="https://github.com/guipleite" target="_blank" rel="nofollow noopener noreferrer"><strong>Guilherme Leite</strong></a>,
and <a href="https://github.com/ViniGl" target="_blank" rel="nofollow noopener noreferrer"><strong>Vinicius Lima</strong></a> created the
<a href="https://mlops-guide.github.io/" target="_blank" rel="nofollow noopener noreferrer">MLOps Guide</a>, which provides a Complete MLOps
development cycle using DVC, CML, and IBM Watson. The multi-page guide covers
the principles of MLOps as well as a full tutorial for building an MLOps
environment. It covers data and model versioning, feature management and
storing, automation of pipelines and processes, CI/CD for machine learning, and
continuous monitoring of models. The guide uses both DVC and CML and includes
videos outlining the project and much of the coding, as well as a project
repository that you can work through.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e19b634d5bfb74525bd6fcfebea425b5/39600/DiagramMLOPs.png" alt="MLOps Guide" title="MLOps Guide" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>MLOps
Guide (<a href="https://mlops-guide.github.io/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="turn-vs-code-into-a-one-stop-shop-for-ml-experiments" style="position:relative;">Turn VS Code Into a One-Stop Shop for ML Experiments<a href="#turn-vs-code-into-a-one-stop-shop-for-ml-experiments" aria-label="turn vs code into a one stop shop for ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/eryklewinson/" target="_blank" rel="nofollow noopener noreferrer"><strong>Eryk Lewinson</strong></a> wrote a fabulous,
<a href="https://towardsdatascience.com/turn-vs-code-into-a-one-stop-shop-for-ml-experiments-49c97c47db27" target="_blank" rel="nofollow noopener noreferrer">in-depth tutorial</a>
on experiment tracking using our new DVC Extension for VS Code. He starts off
with, “One of the biggest threats to productivity in recent times is context
switching.” As a Community Manager, I can so relate! 😅 He posits that the
extension is a great way to both code our experiments and evaluate and compare
them happily in our IDE, without having to jump back and forth between
platforms.</p>
<p><img src="https://dvc.org/2022-12-16/eryk-lewinson-81e150bc16515d76971e4dfdaa417938.gif" alt="DVC Extension for VS Code Experiment Tracking"></p>
<p>Eryk uses a credit card risk dataset and project to show most of the
capabilities of the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a>
and take us through all the steps to show the entire workflow and the resulting
project structure. He notes the best points of the extension are its experiment
bookkeeping with an emphasis on reproducibility and its extended plotting
capabilities including live plotting to visualize model performance while the
model is still being trained. He goes over some tricks and functionality of the
extension as well.</p>
<h3 id="a-fable-about-mlopsand-broken-dreams" style="position:relative;">A Fable About MLOps…and Broken Dreams<a href="#a-fable-about-mlopsand-broken-dreams" aria-label="a fable about mlopsand broken dreams permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d4c9a14e797ce0242da41c27d281855f/39600/alex-burlacu.png" alt="A Fable About MLOps...And Broken Dreams" title="A Fable About MLOps...And Broken Dreams" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>A Fable About MLOps… and Broken Dreams
(<a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p><a href="https://www.linkedin.com/in/alexandru-burlacu" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Burlacu</strong></a> tells a great
story and provides many tips on his experience in MLOps
<a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable" target="_blank" rel="nofollow noopener noreferrer">in this piece</a>
on his blog called <em>A Fable About MLOps… and Broken Dreams</em>. The tale is
likely all too familiar to many of you in our Community in addition to being
validating and entertaining to read. He offers some great prerequisites for
beginning your MLOps journey including quickly finding and accessing your data,
seeding that model training code, and recording your experiment configuration.
Last of these he recommends MLFlow, but as the previous summary from Eryk points
out, this can be done very effectively with the new DVC extension AND be truly
fully reproducible. 🤗</p>
<p>Generally, he recommends starting early and starting small with MLOps. More
technically, he recommends a simple data collection and discovery system, data
versioning with DVC, replicable experiments, experiment tracking, ML serving,
testing, and CI/CD. It's all great advice and fun to read!</p>
<h3 id="ml-pipeline-decoupled---i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-and-python" style="position:relative;">ML Pipeline Decoupled - I managed to write a framework-agnostic ml pipeline with DVC, Rust, and Python<a href="#ml-pipeline-decoupled---i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-and-python" aria-label="ml pipeline decoupled i managed to write a framework agnostic ml pipeline with dvc rust and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2878d680f3bed6dd6fb5751e56ff333/39600/mr-data-psycho.png" alt="Framework Agnostic ML Pipeline with DVC, Rust and Python" title="Rob Toews bets on languge over
images" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://www.linkedin.com/in/mr-data-psycho/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sheikh Samsuzzhan Alam, aka Mr. Data Psycho</strong>,</a>
writes
<a href="https://towardsdev.com/ml-pipeline-decoupled-i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-python-287de68104c9" target="_blank" rel="nofollow noopener noreferrer">this great piece</a>
that reminds us that DVC is language agnostic! While Python is the most popular
language used in Data Science and with DVC, there are some instances where you
may want to use languages such as Rust to speed up memory efficiency and offer a
faster solution for parts of your project. The good news is you can! Mr. Data
Psycho extols the virtues of DVC’s pipelining feature and shows how to use Rust
(Polars) as a pre-processing framework, Sci-kit Learn for model training, and
the rest in Python. Using the yaml files, each stage could be put together using
dependencies written in whatever language your heart desires! You can find the
repo for the project <a href="https://github.com/DataPsycho/mlpipeline-with-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a>.
R users may be interested in this related content
<a href="https://github.com/jcpsantiago/dvthis" target="_blank" rel="nofollow noopener noreferrer">here</a>,
<a href="https://www.youtube.com/watch?v=NwUijrm2U2w&t=2s" target="_blank" rel="nofollow noopener noreferrer">here,</a> and
<a href="https://iterative.ai/blog/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="digital-cheatsheet-for-dvc" style="position:relative;">Digital Cheatsheet for DVC<a href="#digital-cheatsheet-for-dvc" aria-label="digital cheatsheet for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you’d like an online CheatSheet for DVC you can find one
<a href="https://cheat.sh/dvc" target="_blank" rel="nofollow noopener noreferrer">here</a> created by
<a href="https://twitter.com/igor_chubin" target="_blank" rel="nofollow noopener noreferrer"><strong>Igor Chubin</strong></a>. Pick a command from the
drop-down menu and bam 💥, you’ve got the info you need! It’s very cool, but do
always remember to check our docs <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>,
<a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>, and <a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>; we are always
updating them!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/14e259393d78eccef563d67bece560f2/39600/cheatsheet.png" alt="DVC Cheat sheet" title="DVC Cheat sheet" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC
Cheat Sheet (<a href="https://cheat.sh/dvc" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem" style="position:relative;">Akvelon enables non-Python apps to integrate machine learning models with MLEM<a href="#akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem" aria-label="akvelon enables non python apps to integrate machine learning models with mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/aleksandr-dudko-bb475476/" target="_blank" rel="nofollow noopener noreferrer"><strong>Aleksandr Dudko</strong></a>,
<a href="https://www.linkedin.com/in/anatolii-bolshakov-9a25b2199/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anatoly Bolshakov</strong></a>,
<a href="https://www.linkedin.com/in/denis-nosov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Denis Nosov</strong></a>, and
<a href="https://www.linkedin.com/in/vladimir-krestov-4873391ba/" target="_blank" rel="nofollow noopener noreferrer"><strong>Vladimir Krestov</strong></a>,
of <a href="https://akvelon.com/" target="_blank" rel="nofollow noopener noreferrer">Akvelon,</a> wrote
<a href="https://akvelon.com/akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem/" target="_blank" rel="nofollow noopener noreferrer">this great tutorial</a>
on using MLEM to make the process of integrating, packaging, and deploying
machine learning models much easier. In the tutorial, they show how to do this
with Akvelon’s .NET and Java clients for use in existing or new Web (ASP.Net,
Java Spring), Mobile (Xamarin, Android), and Desktop (WPF, WinForms, Java
Spring, Java Spring). Explore the project directory
<a href="https://github.com/akvelon/MLEM-SDK-for-Java" target="_blank" rel="nofollow noopener noreferrer">here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/61de316bc2b4b785a53f2d18a96b4009/39600/akvelon.png" alt="Akvelon enables non-Python apps to integrate machine learning models with MLEM" title="Akvelon enables non-Python apps to integrate machine learning models with MLEM" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Akvelon enables non-Python apps to integrate machine learning models with MLEM
(<a href="https://akvelon.com/akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/LdBroIIcAdoj8NuG6Q/giphy.gif" alt="Awesome Thats Lit GIF by Samsung Austria"></p>
<h2 id="dvc-live-experiment-tracking" style="position:relative;">DVC Live Experiment Tracking<a href="#dvc-live-experiment-tracking" aria-label="dvc live experiment tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We’ve been listening to the greater Community and know you’d like to see easier
experiment tracking from DVC and we’re on it!
<a href="https://iterative.ai/blog/exp-tracking-dvc-python?tab=DVC-extension-for-VS-Code" target="_blank" rel="nofollow noopener noreferrer">The latest release of DVCLive</a>
helps bring that goal to fruition. Now you can track your experiments with only
a couple of lines of code directly from your notebook or your .py file. You can
start with just a repo with Git and DVC initialized, using your existing tools;
eliminating the need for a hosted solution or setting up a server or database.
Keep track of all the metadata related to the experiment in your Git provider of
choice (GitHub/GitLab), and your cloud storage, and share with your team when
you are ready. In addition, you can use Iterative Studio to share the results of
your experiments with teammates.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5e118b266ce2ac73c087600246e067a6/03346/ariel-biller.jpg" alt="Ariel Biller Experiment Tracking meme" title="Ariel Biller Experiment Tracking meme" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Ariel Biller's Experiment Tracking meme
(<a href="https://twitter.com/untitled01ipynb/status/1593911944989270016?s=20&t=h0rvf7Bi7ikf9E3hna4vYw" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="new-unstructured-data-query-language" style="position:relative;">New Unstructured Data Query Language<a href="#new-unstructured-data-query-language" aria-label="new unstructured data query language permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new
solution for finding and managing your datasets of unstructured data like
images, audio files, and PDFs! Extend your DVC environment with the first
unstructured data query language (think SQL -> DQL) for machine learning. We are
looking for beta customers for this new tool.</p>
<p><a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">Schedule a meeting with us</a>
if that's what you're needing!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8e61fa8c2db431382c5a89161f23db10/39600/dvc-cloud.png" alt="Unstructured Data Query Language from the makers of DVC" title="Unstructured Data Query Language from the makers of DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Unstructured Data Query Language Prototype</em></p>
<h2 id="gto-tutorial-on-the-blog" style="position:relative;">GTO Tutorial on the Blog<a href="#gto-tutorial-on-the-blog" aria-label="gto tutorial on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A model registry is a tool to catalog ML models and their versions. Models from
your data science projects can be discovered, tested, shared, deployed, and
audited from there. Learn how to build a model registry in a DVC Git repo
without involving any extra services, integrations, and APIs in
<a href="https://iterative.ai/blog/gto-model-registry" target="_blank" rel="nofollow noopener noreferrer">this new post</a> from
<a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Guschin</strong></a>!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bcd104768a32e723be669c72e5520ba0/03346/drawing-owl-step-by-step.jpg" alt="Building a GitOps ML Model Registry with DVC and GTO" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>On January 11th,
<a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a>
will be joining us to share about his recent contribution to MLEM through his
work on GTO and how this helps him in his work at
<a href="https://www.eon.de/de/pk.html" target="_blank" rel="nofollow noopener noreferrer">E.On Energie Deutschland</a> with creating a
Git-based model registry.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289772002/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Francesco Calcavecchia on Designing a model Registry with Legacy Systems
using DVC and GTO</h4>
<div class="elp-description">Join us on January 11th. Designing a Model
Registry with Legacy Systems using GTO!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-12-16/meetup-4b11eb06e8fc8da7fcb2fe756fabd127.png" alt="Francesco Calcavecchia on Designing a model Registry with Legacy Systems
using DVC and GTO">
</div>
</a>
</section>
<p></p>
<h2 id="flappy-deevee" style="position:relative;">Flappy DeeVee<a href="#flappy-deevee" aria-label="flappy deevee permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our global, all-remote team works hard, but we also have fun! We have a weekly
All-Hands meeting where our teams report progress via pre-recorded video so that
everyone can be prepared to discuss the topic during the meeting.</p>
<p>As we all level up our video production skills, the videos have started to get
more fun!
<a href="https://www.linkedin.com/in/jesper-svendsen-10892b1bb/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jesper Svendsen</strong></a>
inserted this FlappyDeeVee video in the middle of our Iterative Studio update!
Try the game <a href="https://flappycreator.com/flappy.php?id=638f6f7f1e9c8" target="_blank" rel="nofollow noopener noreferrer">here!</a>
Confession: I can’t get past the first pipe! 😆</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-12-16/FlappyDeeVee-8bb7e63a03292475db9db6842b23780c.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<p>Stay tuned to
<a href="https://iterative.ai/#:~:text=Go%20to%20Twitter-,Subscribe,-for%20updates.%20We" target="_blank" rel="nofollow noopener noreferrer">our Newsletter </a>
for more content from the Community and what we will be up to conference-wise in
2023!</p>
<h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> team recently made updates to their commands to make
them more intuitive. If you were used to the old ones, do not fret, info will
pop up in the CLI to remind you if you use the old commands and what the new
ones are. In the meantime, you can get up to date on the changes
<a href="https://cml.dev/doc/ref" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our
<a href="https://iterative.ai/blog/jupyter-notebook-dvc-pipeline" target="_blank" rel="nofollow noopener noreferrer">Notebooks to DVC Pipeline for Reproducible Experiments</a>
from
<a href="https://www.linkedin.com/in/rcdewit?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAA5CEPkB9fI02IpClBKhRdq2brULPHMhmR8&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BaKm1eO7JQle9sN63j%2FHHFA%3D%3D" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a>
was noted in <a href="https://twitter.com/dl_weekly" target="_blank" rel="nofollow noopener noreferrer">Deep Learning Weekly.</a></p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🤖 Issue #276 is now live! This week in deep learning: AI with the right dose of curiosity, notebooks to DVC pipelines for reproducible experiments, generating human-level text with contrastive search, an open-source data exploration tool, and more.<a href="https://t.co/JXUkrOEYzC">https://t.co/JXUkrOEYzC</a></p>— Deep Learning Weekly (@dl_weekly) <a href="https://twitter.com/dl_weekly/status/1592900833741393920">November 16, 2022</a></blockquote>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/exp-tracking-dvc-pythonhttps://dvc.org/blog/exp-tracking-dvc-pythonThu, 15 Dec 2022 00:00:00 GMT<p>Did you know that DVC can track experiments? Now you can track experiments in
DVC by changing a few lines of your Python code.</p>
<p>And with the optional <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>, you have a full-fledged
experiment tracking interface in your IDE!</p>
<toggle>
<tab title="DVC extension for VS Code">
<p><video controlslist="nodownload" preload="metadata" muted controls style="width:100%;"><source src="/2022-12-15/dvclive_exp_tracking-42c4f5a2c17a7b093355745095508589.mp4" type="video/mp4">
Your browser does not support the video tag. </video></p>
</tab>
<tab title="Notebook">
<p><video controlslist="nodownload" preload="metadata" muted controls style="width:100%;"><source src="/2022-12-15/dvclive_exp_tracking_cli-ec65b9aeb72b88e9e31a4c80af52f5b3.mp4" type="video/mp4">
Your browser does not support the video tag. </video></p>
</tab>
</toggle>
<h1 id="why" style="position:relative;">Why?<a href="#why" aria-label="why permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We want to bring the DVC ethos to experiment tracking, but the learning curve
for DVC can be steep. That's why we built our Python logging library <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>
to make it easy to start.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 430px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/21e2e2944c1c1b11883c74e0932f31b5/39600/another_exp_tracker.png" alt="another exp tracker" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>source:
<a href="https://twitter.com/untitled01ipynb/status/1593911944989270016" target="_blank" rel="nofollow noopener noreferrer">https://twitter.com/untitled01ipynb/status/1593911944989270016</a></em></p>
<p>All you need to start is a Git repo. There are no logins, servers, databases, or
UI to spin up. Every experiment run is saved in a Git commit, but those commits
are hidden so they don't clutter your repo, unlike saving each run to a separate
directory, or creating a Git branch for each.</p>
<p>From that simple starting point, DVC experiment tracking grows with your
project. You don't have to decide today whether you will need to share with your
team or backup to cloud storage. That's because DVC builds on top of the tools
you already use and allows you to incrementally integrate them.</p>
<p>When you need to
<a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments" target="_blank" rel="nofollow noopener noreferrer">share</a>,
push existing experiments to your Git provider (GitHub/GitLab). When you need
artifact
<a href="https://dvc.org/doc/start/data-management/data-versioning#storing-and-sharing" target="_blank" rel="nofollow noopener noreferrer">storage</a>,
add your own cloud provider and push your existing artifacts. When you need a
UI, use VS Code or add <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> for a collaborative interface.</p>
<h1 id="how-to-start" style="position:relative;">How to start<a href="#how-to-start" aria-label="how to start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Check out the example <a href="https://github.com/iterative/dvclive-exp-tracking" target="_blank" rel="nofollow noopener noreferrer">repo</a>, try it out in a <a href="https://colab.research.google.com/drive/1VKEBdSgFdEjg-k6FqNXX-0o83QWcpmN_?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a>, or follow the
steps below to start with your own model training code.</p>
<ol>
<li>
<p>Install DVC>=2.38.0 as a library in your Python environment.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> dvc</span></code></pre></div>
</li>
<li>
<p>Setup a DVC repo where your model training code is (or use an existing repo).</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token parameter variable">-A</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"setup dvc repo"</span></span></code></pre></div>
</li>
<li>
<p>In your code, enable DVC experiment tracking using <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> with
<code>save_dvc_exp=True</code>. Use the callback for your framework or log your own
metrics. You can find examples below
(<a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks" target="_blank" rel="nofollow noopener noreferrer">other frameworks available</a>):</p>
</li>
</ol>
<toggle>
<tab title="Pytorch Lightning">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>lightning <span class="token keyword">import</span> DVCLiveLogger
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
trainer <span class="token operator">=</span> Trainer<span class="token punctuation">(</span>logger<span class="token operator">=</span>DVCLiveLogger<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>model<span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="Hugging Face">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>huggingface <span class="token keyword">import</span> DVCLiveCallback
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="Keras">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DVCLiveCallback
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>
train_dataset<span class="token punctuation">,</span> validation_data<span class="token operator">=</span>validation_dataset<span class="token punctuation">,</span>
callbacks<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
</tab>
<tab title="General Python API">
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live
<span class="token keyword">with</span> Live<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_param<span class="token punctuation">(</span><span class="token string">"epochs"</span><span class="token punctuation">,</span> NUM_EPOCHS<span class="token punctuation">)</span>
<span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>NUM_EPOCHS<span class="token punctuation">)</span><span class="token punctuation">:</span>
train_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span>
metrics <span class="token operator">=</span> evaluate_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> metric_name<span class="token punctuation">,</span> value <span class="token keyword">in</span> metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span>
live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
</tab>
</toggle>
<p>4. Run your code and track the experiment results.</p>
<toggle>
<tab title="DVC extension for VS Code">
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/37aa508562076134553fffd26e1c8c4b/39600/dvclive_exp_tracking.png" alt="dvclive exp tracking" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
</tab>
<tab title="Command line">
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token comment"># Show the experiments table in the terminal.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span>
</span> ────────────────────────────────────────────────────────────────────────────────────
Experiment Created train_loss epoch step encoder_size
────────────────────────────────────────────────────────────────────────────────────
workspace - 0.020196 4 500 512
main Dec 06, 2022 - - - -
├── c1759a5 [quare-foil] 08:55 PM 0.020196 4 500 512
├── affedee [bitty-tass] 08:55 PM 0.02038 4 500 256
├── a5bdc18 [murky-emeu] 08:55 PM 0.016396 4 500 128
├── 744f3b6 [sworn-wage] 08:54 PM 0.01972 4 500 64
└── 0c3ac81 [named-gaby] 08:54 PM 0.031206 4 500 32
────────────────────────────────────────────────────────────────────────────────────
<span class="token comment"># Plot the diff of all experiments in an HTML file.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token variable"><span class="token variable">$(</span>dvc exp list --name-only<span class="token variable">)</span></span>
</span>file:///Users/dave/Code/dvclive-exp-tracking/dvc_plots/index.html</code></pre></div>
<p>Open the HTML to see the plots:</p>
<p><img src="https://dvc.org/2022-12-15/dvclive_exp_tracking_plots_diff-4da17e97756bf8f97e5ad63ca9f8ca3c.svg" alt="" title="=500"></p>
</tab>
</toggle>
<h1 id="stay-tuned" style="position:relative;">Stay tuned<a href="#stay-tuned" aria-label="stay tuned permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>That's all there is to it! There's lots more coming for DVC experiment tracking,
including:</p>
<ul>
<li>
<p><strong>Showing you where to go from here</strong>. Share your experiments, add data or
pipelines, and use DVC without ever leaving your notebook or Python IDE.</p>
</li>
<li>
<p><strong>Adding more DVCLive features</strong>. Share realtime updates to <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative
Studio</a>, log data and model artifacts, and compare experiments in Python.</p>
</li>
</ul>
<p>Try out the <a href="https://github.com/iterative/dvclive-exp-tracking" target="_blank" rel="nofollow noopener noreferrer">repo</a> or <a href="https://colab.research.google.com/drive/1VKEBdSgFdEjg-k6FqNXX-0o83QWcpmN_?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a> and let us know what you think in
<a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or
<a href="https://github.com/iterative/dvc/issues" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/gto-model-registryhttps://dvc.org/blog/gto-model-registryWed, 07 Dec 2022 00:00:00 GMT<p>Machine Learning is iterative in its nature. Similar to developing software,
you’re going to have many different versions of your models, improving them step
by step (such as <code>v0.1.0</code>, <code>v0.2.0</code>, etc). To keep track of model development,
trigger checks, and deployments, and know which versions are in production and
which are stuck in staging (both right now and retrospectively), ML specialists
organize models' lifecycles using Model Registries.</p>
<h2 id="the-pluses-and-minuses-of-model-registries" style="position:relative;">The Pluses and Minuses of Model Registries<a href="#the-pluses-and-minuses-of-model-registries" aria-label="the pluses and minuses of model registries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>While model registries solve operational issues, many solutions come at a cost.
Model Registries often introduce a separate software stack that must be learned,
integrated with, and maintained. For example, if you keep your model training
code in Git, train your models with CI/CD, and use CI/CD to deploy them,
introducing a separate service in the middle of the process breaks the flow and
forces you to leave your code versioning ecosystem (Git + GitHub for example).
This happens when we add more and more systems and services that all try to be
the center of attention. A good example is working with MLFlow or SageMaker as a
model registry - there’s a feeling it’s always “in the way” of the Git-based
development workflow.</p>
<h2 id="our-git-based-solution-to-model-registry" style="position:relative;">Our Git-based Solution to Model Registry<a href="#our-git-based-solution-to-model-registry" aria-label="our git based solution to model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To help you with that, we developed a CLI tool named
<a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>. The tool is very simple - it organizes
Model Registry in your Git repo using Git tags and a file called
<code>artifacts.yaml</code>. Welcome to this short tutorial on how to do just that - and
it's simpler than you might think.</p>
<p>Before we start, let’s take a look at
<a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer"><strong>Studio Model Registry</strong></a>, which provides
a nice UI dashboard on top of GTO-managed registries:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/63d892c74b838ee3620d17e5dd877e95/39600/iterative-studio-model-registry.png" alt="Iterative Studio Model Registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
The model dashboard above has three models from a single Git repo (we’ll add
another one in a minute).
<a href="https://github.com/iterative/demo-bank-customer-churn/tags" target="_blank" rel="nofollow noopener noreferrer">Git tags</a> in this
repo represent the version registrations (such as <code>v2.0.1</code> or <code>v1.0.1</code>) and
stage assignments (like <code>dev</code>, <code>prod</code>, and <code>staging</code>) done by team members
(assigning <code>v1.0.0</code> to <code>dev</code> signals the version is ready to be deployed to the
<code>dev</code> environment and can trigger that deployment directly).</p>
<admon type="tip">
<p>Take a look around in our
<a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">demo Model Registry</a> to get
a feel for Iterative Studio's Model Registry features.</p>
</admon>
<p>GTO provides a simplistic representation of the same from CLI, thus accessible
from a terminal and friendly for a developer:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> https://github.com/iterative/demo-bank-customer-churn
</span>╒════════════════════╤══════════╤════════╤═════════╤══════════╕
│ name │ latest │ <span class="token comment">#dev │ #prod │ #stage │</span>
╞════════════════════╪══════════╪════════╪═════════╪══════════╡
│ randomforest-model │ v2.0.0 │ v2.0.0 │ v1.0.0 │ - │
│ xgboost-model │ v1.0.1 │ - │ - │ v1.0.0 │
│ lightgbm-model │ v2.0.3 │ v2.0.3 │ v2.0.0 │ v2.0.0 │
╘════════════════════╧══════════╧════════╧═════════╧══════════╛</code></pre></div>
<p>Notice that GTO works with a single repo at a time - that’s why we need to
specify the <code>--repo</code> argument, while Studio aggregates your models from multiple
projects and repositories you add to it.</p>
<p>For this tutorial, we'll pick a simple project with no models registered yet, to
demonstrate adding a model registry on top of an existing ML project. We'll take
<a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/example-get-started</a>, which is an example DVC
project. We won’t get into details about DVC, but if you’re new to it, you can
check out <a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">DVC Get Started</a>. Revisit the example
project before we start to get a quick picture of it if you wish.</p>
<p>The project trains a natural language processing (NLP) binary classifier
predicting tags for a given StackOverflow question. It uses DVC Pipelines to
connect raw text preprocessing and model training, producing an ML model stored
in the <code>model.pkl</code>. The <code>main</code> branch has a model version we can consider as the
first version, while the branch <code>try-large-dataset</code> is a promising experiment
that we’d like to mark as the second version and assign to the <code>dev</code> stage to
trigger a deployment.</p>
<p>To start, we need to
<a href="https://github.com/iterative/example-get-started/fork" target="_blank" rel="nofollow noopener noreferrer">fork the repo</a>, since
we’re going to make some changes to it. Note that you need to uncheck "Copy the
<code>main</code> branch only" because we'll be using the <code>try-large-dataset</code> branch as
well:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f4725bb8104a35916038304c9aac6e22/39600/fork-uncheck.png" alt="fork" title="fork" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>To use GTO from CLI, we'll set up a Python virtual environment:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> venv .venv
</span><span class="token line"><span class="token input">$ </span><span class="token command">source</span> .venv/bin/activate
</span><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> gto</span></code></pre></div>
<p>To remove some friction, we won’t clone the repo locally. This will save us from
running <code>commit</code> and <code>push</code> to update the remote repo, and GTO will do that for
us.</p>
<h2 id="registering-a-model-version" style="position:relative;">Registering a model version<a href="#registering-a-model-version" aria-label="registering a model version permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the repo, we have an already trained ML model saved as <code>model.pkl</code>. The file
itself resides in an AWS S3 bucket and is tracked with DVC. One of the versions
of that model can be found in the HEAD of the <code>main</code> branch. Let’s register the
very first version of it - <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer"><code>v0.0.1</code></a>. Since we’ll be using
our remote repo many times here, we'll set a shell var <code>$REPO</code> to store the URL.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REPO</span><span class="token operator">=</span>https://github.com/<span class="token punctuation">{</span>user<span class="token punctuation">}</span>/example-get-started
</span><span class="token line"><span class="token input">$ </span><span class="token gto">gto register</span> classifier <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span>
</span>Created git tag '[email protected]' that registers a version
Running `git push origin [email protected]`
Successfully pushed git tag [email protected] on remote.</code></pre></div>
<p>Now the model is called <code>classifier</code> in our registry and the <code>v0.0.1</code> version is
registered in the tip of the <code>main</code> branch.</p>
<p>Since the repo we're working with is a remote one, GTO pushes a tag to the repo
automatically. With a local repo, you will need to run <code>git push</code> yourself
(although you can make GTO do that by providing a <code>--push</code> argument). This
workflow should be familiar to DVC and Git users - making changes locally and
then pushing them to remote with an additional command.</p>
<p>Now we can see the model dashboard of our registry:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span>
</span>╒════════════╤══════════╕
│ name │ latest │
╞════════════╪══════════╡
│ classifier │ v0.0.1 │
╘════════════╧══════════╛</code></pre></div>
<p>Remember, that we only see a single <code>classifier</code> model because GTO works with a
single repo and the models we’ve seen above were from another repository (notice
the <code>--repo</code> argument).</p>
<p>A common case is to use a model registry as a source of truth to pull models for
experimentation locally or in CI for deployments. Note that for now we manually
provide the path to the model (<code>model.pkl</code>) and Git revision to use
(<code>[email protected]</code>). We’ll learn how to dynamically set them up using GTO in
the next sections.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> model.pkl <span class="token parameter variable">--rev</span> [email protected] <span class="token parameter variable">-o</span> model.pkl</span></code></pre></div>
<h2 id="adding-optional-model-metadata" style="position:relative;">Adding optional model metadata<a href="#adding-optional-model-metadata" aria-label="adding optional model metadata permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To skip hardcoding a model path in our scripts or writing model description
somewhere in the notebook, we need to store metadata about the model in the repo
itself. Unlike the Git tag, we created to register a version, GTO stores
metadata in a file, which requires us to create a commit. This allows us to have
different paths or descriptions in different commits and branches, which can be
useful if you’ll be updating your model significantly or changing the structure
of your repo. Since the model is not annotated right now, let’s add that
information to the new model version in the <code>try-large-dataset</code> branch that
<a href="https://studio.datachain.ai/team/Iterative/projects/example-get-started-zde16i6c4g" target="_blank" rel="nofollow noopener noreferrer">increased ROC AUC of the model</a>.
Later we can merge this to <code>main</code> to update the annotation there:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto annotate</span> classifier<span class="token punctuation">\</span>
<span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--rev</span> try-large-dataset <span class="token punctuation">\</span>
<span class="token parameter variable">--path</span> model.pkl <span class="token punctuation">\</span>
<span class="token parameter variable">--description</span> <span class="token string">"Simple text classification model"</span>
</span> --type model
Updated `artifacts.yaml`
Running `git commit` and `git push`
Successfully pushed a new commit to remote.</code></pre></div>
<p>This creates an <code>artifacts.yaml</code> file with the following contents in the
<code>try-large-dataset</code> branch:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">classifier</span><span class="token punctuation">:</span>
<span class="token key atrule">path</span><span class="token punctuation">:</span> model.pkl
<span class="token key atrule">description</span><span class="token punctuation">:</span> Simple text classification model
<span class="token key atrule">type</span><span class="token punctuation">:</span> model</code></pre></div>
<h2 id="registering-another-version" style="position:relative;">Registering another version<a href="#registering-another-version" aria-label="registering another version permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Since GTO allows you to build any kind of registry, including dataset registry,
model registry, or a mix of both, to distinguish between different artifact
types (e.g. a <code>dataset</code> and a <code>model</code>), it’s good to specify <code>type</code> while
annotating. This will also hint to Studio that <code>classifier</code> is a <code>model</code> so
Studio could display it in Studio Model Registry.</p>
<p>Let’s register a new version in the <code>try-large-dataset</code> branch:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto register</span> classifier <span class="token punctuation">\</span>
<span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--rev</span> try-large-dataset
</span>Created git tag '[email protected]' that registers version
Running `git push origin [email protected]`
Successfully pushed git tag [email protected] on remote.</code></pre></div>
<p>Checking the updated model dashboard:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span>
</span>╒════════════╤══════════╕
│ name │ latest │
╞════════════╪══════════╡
│ classifier │ v0.0.2 │
╘════════════╧══════════╛</code></pre></div>
<p>The latest version of <code>classifier</code> is now <code>v0.0.2</code>.</p>
<p>To download the model and use it locally, now we can let GTO resolve the path
from the value stored in <code>artifacts.yaml</code>, and download it using DVC: script and
can use the value stored in the repo:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REVISION</span><span class="token operator">=</span>[email protected]
</span><span class="token line"><span class="token input">$ </span><span class="token command">MODEL_PATH</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto describe classifier $REPO <span class="token parameter variable">--rev</span> $REVISION <span class="token parameter variable">--path</span><span class="token variable">)</span></span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> <span class="token variable">$MODEL_PATH</span> <span class="token parameter variable">--rev</span> <span class="token variable">$REVISION</span> <span class="token parameter variable">-o</span> <span class="token variable">$MODEL_PATH</span></span></code></pre></div>
<h2 id="assigning-stages-to-deploy-a-model" style="position:relative;">Assigning stages to deploy a model<a href="#assigning-stages-to-deploy-a-model" aria-label="assigning stages to deploy a model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, we have two registered versions of our model: <code>v0.0.1</code> and <code>v0.0.2</code>. How do
we get one of them into production? To signal the model version is ready to be
used in some environment, we can assign it to a stage:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto assign</span> classifier <span class="token punctuation">\</span>
<span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--version</span> v0.0.2 <span class="token punctuation">\</span>
<span class="token parameter variable">--stage</span> dev
</span>Created Git tag 'classifier#dev#1' that assigns stage
Running `git push origin classifier#dev#1`
Successfully pushed git tag classifier#dev#1 on remote.</code></pre></div>
<p>To actually start the deployment process, we'll need to set up a CI/CD that can
be triggered by pushing a Git tag. We'll discuss this in the next section.</p>
<p>Now the model dashboard will be updated with the newly assigned <code>dev</code> stage:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span>
</span>╒════════════╤══════════╤════════╕
│ name │ latest │ <span class="token comment">#dev │</span>
╞════════════╪══════════╪════════╡
│ classifier │ v0.0.2 │ v0.0.2 │
╘════════════╧══════════╧════════╛</code></pre></div>
<p>When running <a href="https://dvc.org/doc/gto/command-reference/show"><code>gto show</code></a> for a specific model, we will get all of its registered
versions. Notice that the stage is marked at the latest version that was
assigned to it - to signal the currently deployed model version in that stage:</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">$ gto show classifier --repo $REPO
╒════════════╤═══════════╤═══════════╤═══════════════════╕
│ artifact │ version │ stage │ ref │
╞════════════╪═══════════╪═══════════╪═══════════════════╡
│ classifier │ v0.0.2 │ dev │ [email protected] │
│ classifier │ v0.0.1 │ │ [email protected] │
╘════════════╧═══════════╧═══════════╧═══════════════════╛</code></pre></div>
<p>Having dozens of models, it’s easier to automate figuring out what versions are
currently assigned to stages. For that, we can use a variation of the <code>show</code>
command. To download the <code>classifier</code> version in <code>dev</code>:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REVISION</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto show classifier<span class="token comment">#dev --repo $REPO --ref</span><span class="token variable">)</span></span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">MODEL_PATH</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto describe classifier <span class="token parameter variable">--repo</span> $REPO <span class="token parameter variable">--rev</span> $REVISION<span class="token variable">)</span></span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> <span class="token variable">$MODEL_PATH</span> <span class="token parameter variable">--rev</span> <span class="token variable">$REVISION</span> <span class="token parameter variable">-o</span> <span class="token variable">$MODEL_PATH</span></span></code></pre></div>
<h2 id="starting-cicd-for-new-versions-and-assignments" style="position:relative;">Starting CI/CD for new versions and assignments<a href="#starting-cicd-for-new-versions-and-assignments" aria-label="starting cicd for new versions and assignments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>CI/CD is a common way to set up some automation - including building your models
into Docker images or deploying them to Kubernetes or SageMaker. Since new
versions and stage assignments are implemented using Git tags, they can
automatically kick off CI/CD process that you can set up with
<a href="https://docs.github.com/en/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a> or any other CI/CD tool,
allowing you to programmatically react with actions you would like to perform.</p>
<p>Showing a full CI/CD example is worthy of a dedicated blog post, so we’ll save
it for another time. If you want to see how it works, there are two examples in
the <a href="https://github.com/iterative/example-gto/actions" target="_blank" rel="nofollow noopener noreferrer">GTO example repo</a>. The
one in the <code>main</code> branch
<a href="https://github.com/iterative/example-gto/blob/main/.github/workflows/gto-act-on-tags.yml" target="_blank" rel="nofollow noopener noreferrer">shows how to parse a Git tag</a>
to react on new versions and stage assignments differently, while the other in
the <code>mlem</code> branch explains
<a href="https://github.com/iterative/example-gto/blob/mlem/.github/workflows/deploy-model-with-mlem.yml" target="_blank" rel="nofollow noopener noreferrer">how to deploy your model in a single line</a>
with
<a href="https://github.com/iterative/example-gto/blob/mlem/.github/workflows/deploy-model-with-mlem.yml" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p>
<h2 id="taking-a-high-level-look-at-our-model-registry" style="position:relative;">Taking a high-level look at our Model Registry<a href="#taking-a-high-level-look-at-our-model-registry" aria-label="taking a high level look at our model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We just learned how to register semantic model versions, assign stages to them,
and employ CI/CD to act on those, all using a GitOps approach. Used together
with DVC, this allows us to accomplish the main use cases for a powerful model
registry, while not introducing any extra services and staying inside a Git
Repo.</p>
<p>As we saw above, GTO works within a single repo and requires you to work in CLI.
To lift these limitations, we introduced Iterative Studio Model Registry which,
in a nutshell, is a friendly UI that allows you to work with GTO artifacts
gathered from multiple repositories. This is what
<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio Model Registry</a> will look like if you log
in and add the repo:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/84f17464955e2d4b4d341bf40212d5e3/39600/iterative-studio-model-registry-2.png" alt="Iterative Studio Model Registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Besides the <code>classifier</code> model that we just registered, you can also see three
other models from our example <code>demo-bank-customer-churn</code> repo.</p>
<p>Behind the scenes,
<a href="https://dvc.org/doc/studio" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio just uses GTO API</a>, so there are
no new magic tricks here (and you can also use GTO API from your automation
Python code if you wish). Feel free to play around to register more versions,
assign stages or annotate the other models you have, and see how Studio can help
you track model lineage, audit events, and connect model versions to DVC
experiments.</p>
<h2 id="whats-next" style="position:relative;">What’s next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Check out <a href="https://mlem.ai/doc/gto/" target="_blank" rel="nofollow noopener noreferrer">GTO docs</a> to learn more about the tool and
ask us questions in <a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a> - we’re happy to help you!</p>
<p>Take a look at our
<a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">public Model Registry</a> so
you can see for yourself how Iterative Studio puts together a Git based Model
Registry experience.</p>
<p>Share your feedback in <a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or
<a href="https://github.com/iterative/gto/issues" target="_blank" rel="nofollow noopener noreferrer">GitHub issues</a> to help us build an
open-source Model Registry on top of Git, so you can stick to an existing
software engineering stack. No more divide between ML engineering and
operations!</p>https://dvc.org/blog/november-2022-heartbeathttps://dvc.org/blog/november-2022-heartbeatFri, 18 Nov 2022 00:00:00 GMT<p>Welcome to November! In the US, this is the time of year we reflect and give
thanks. It's been a productive year despite the world's rather extreme
challenges. There's lots to be thankful for. Here are some of those things from
the last month in the Iterative Community.</p>
<h1 id="ai-news" style="position:relative;">AI News<a href="#ai-news" aria-label="ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h3 id="robert-toews---the-biggest-opportunity-in-generative-ai-is-language-not-images" style="position:relative;">Robert Toews - The Biggest Opportunity in Generative AI Is Language, Not Images<a href="#robert-toews---the-biggest-opportunity-in-generative-ai-is-language-not-images" aria-label="robert toews the biggest opportunity in generative ai is language not images permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/34a929a6ca9a22a5520ff7aa9b90ef39/03346/forbes.jpg" alt="NLP" title="Rob Toews bets on languge over
images" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://www.forbes.com/sites/robtoews/2022/11/06/the-biggest-opportunity-in-generative-ai-is-language-not-images/?sh=303a5719789d" target="_blank" rel="nofollow noopener noreferrer">In this article</a>
entitled <em>The Biggest Opportunity In Generative AI Is Language, Not Images</em>,
<a href="https://www.linkedin.com/in/robtoews/" target="_blank" rel="nofollow noopener noreferrer"><strong>Robert Toews</strong></a> argues that AI-powered
text generation will create many orders of magnitude more value than
text-generated images.</p>
<blockquote>
<p>Language is humanity’s single most important invention. More than anything
else, it is what sets us apart from every other species on the planet.
Language enables us to reason abstractly, to develop complex ideas about what
the world is and could be, to communicate these ideas to one another, and to
build on them across generations and geographies. Almost nothing about modern
civilization would be possible without language.</p>
</blockquote>
<p>He points out the many examples from a variety of industries and academia that
have gained and will continue to gain massive improvements due to the power of
large language models (LLMs) in the coming years. Read the article for all the
applications.</p>
<h3 id="state-of-ai-report" style="position:relative;">State of AI Report<a href="#state-of-ai-report" aria-label="state of ai report permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The
<a href="https://docs.google.com/presentation/d/1WrkeJ9-CjuotTXoa4ZZlB3UPBXpxe4B3FMs9R9tn34I/edit#slide=id.g164b1bac824_0_2794" target="_blank" rel="nofollow noopener noreferrer">State of AI Report</a>
is generated each year and reports on the most interesting things the authors,
<a href="https://twitter.com/nathanbenaich" target="_blank" rel="nofollow noopener noreferrer"><strong>Nathan Benaich</strong></a>,
<a href="https://twitter.com/soundboy" target="_blank" rel="nofollow noopener noreferrer"><strong>Ian Hogarth</strong></a>,
<a href="https://twitter.com/osebbouh" target="_blank" rel="nofollow noopener noreferrer"><strong>Othmane Sebbouh</strong></a>, and
<a href="https://twitter.com/nitarshan" target="_blank" rel="nofollow noopener noreferrer"><strong>Nitarshan Rajkumar</strong></a> come across in the world
of AI throughout the year.</p>
<ul>
<li>Slide 22: Mirroring the ideas of the Toews article above, this slide discusses
the LLM use case of conversational code generation. OpenAI's Codex, which
powers <a href="https://github.com/features/copilot" target="_blank" rel="nofollow noopener noreferrer">GitHub's Copilot</a> to produce this
capability was on display at the recent
<a href="https://watch.githubuniverse.com/home" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a>. Other companies
including Salesforce, Google, and DeepMind are working on Code generating
projects of their own with Google's LLM PaLM coming out as a favored option
with 50x less code than Codex. Alternatively DeepMind's AlphaCode generates
the whole program as opposed to lines of code.</li>
<li>Slide 24: Continuing to echo Toews' article, in research LLMs are greatly
improving their mathematical abilities, jumping to far better scores than
previous model versions. Techniques that helped to achieve these gains are
discussed</li>
<li>Slides 30 and 31: Challenging Toews' stance, these slides show the great
progress in Computer Vision. Diffusion models are doing more than just
text-to-image generation. Now they are being used for text-to-video, text
generation, audio, molecular design, and more. Info on the techniques now
being used can be found in Slide 30. Side 31 discusses the huge improvement in
the next generation of text-to-image generation competing models including
DALL-E, Imagen, and Parti.</li>
</ul>
<p>Be sure to digest the whole report for even more AI advances!</p>
<p>💓 So for our “Pulse check” this month:</p>
<admon type="tip">
<p>Do you agree that NLP will have more impact than computer vision? Tell us about
what you are working on with NLP. We’d love to get you connected with others
struggling with similar issues and know how we can improve our tools to help you
with your NLP projects.</p>
</admon>
<p>Join us in the <code>#general</code> channel in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> to weigh in.</p>
<h1 id="community-content-highlights" style="position:relative;">Community Content Highlights<a href="#community-content-highlights" aria-label="community content highlights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="thank-you-hacktoberfest-contributors" style="position:relative;">Thank you Hacktoberfest Contributors!<a href="#thank-you-hacktoberfest-contributors" aria-label="thank you hacktoberfest contributors permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We would like to thank
<a href="https://github.com/francesco086" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a>,
<a href="https://github.com/vvssttkk" target="_blank" rel="nofollow noopener noreferrer"><strong>vvssttkk</strong></a>, and
<a href="https://github.com/deepyaman" target="_blank" rel="nofollow noopener noreferrer"><strong>deepyaman</strong></a> for their contributions to
<a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>,
<a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM,</a> and
<a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML</a> respectively. They will be receiving
their own personalized shirts that note their contributions! And many thanks to
<a href="https://www.linkedin.com/in/mertbozkir/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir</strong></a> for leading the
Hacktoberfest charge here at Iterative!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e80ba8968ec0e28cc7bcd1e8eb624382/39600/hacktoberfest.png" alt="Hacktoberfest Contributors" title="Hacktoberfest Contributors" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>2022 Hacktoberfest Contributions</em></p>
<h2 id="joão-santiago-and-team-presenting-on-their-use-of-dvc-at-the-nlp-in-closure-session-2-event" style="position:relative;">João Santiago and team presenting on their use of DVC at the NLP in Closure Session 2 event<a href="#jo%C3%A3o-santiago-and-team-presenting-on-their-use-of-dvc-at-the-nlp-in-closure-session-2-event" aria-label="joão santiago and team presenting on their use of dvc at the nlp in closure session 2 event permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>One of our Community Champions,
<a href="https://www.linkedin.com/in/jcpsantiago/" target="_blank" rel="nofollow noopener noreferrer"><strong>João Santiago</strong></a> of
<a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> gives an introduction to DVC in preparation
for the remainder of the session where
<a href="https://scicloj.github.io/blog/predict-real-vs.-fake-disaster-tweets/" target="_blank" rel="nofollow noopener noreferrer"><strong>Carsten Behring</strong></a>,
author of <a href="https://cljdoc.org/d/scicloj/metamorph/0.2.1/doc/readme" target="_blank" rel="nofollow noopener noreferrer">Metamorph</a>
and the <a href="https://github.com/scicloj/scicloj.ml" target="_blank" rel="nofollow noopener noreferrer">scicloj.ml</a> platform presents
how NLP pipelines can be managed with DVC, Closure & Python.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/eubg-fjRh9E?rel=0&%3B=&%3Bshowinfo=0%3B&start=914" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="cml-at-neurips" style="position:relative;">CML at NeurIPS<a href="#cml-at-neurips" aria-label="cml at neurips permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last month we reported on CML turning up in research
<a href="https://iterative.ai/blog/october-heartbeat#cml" target="_blank" rel="nofollow noopener noreferrer">here</a>. Well, this work will be
presented within the virtual Workshop
<a href="https://neurips.cc/media/PosterPDFs/NeurIPS%202022/62157.png" target="_blank" rel="nofollow noopener noreferrer">Challenges In Deploying and Monitoring Machine Learning Systems</a>
at NeurIPS virtual this year on December 9th.
<a href="https://neurips.cc/" target="_blank" rel="nofollow noopener noreferrer">Find out more and register here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7bd565f0a5e5e75c1083e91b224cc9b8/39600/cml-neurips.png" alt="CML at NeurIPS" title="CML at NeurIPS" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Research
on CML to be presented at NeurIPS
(<a href="https://neurips.cc/media/PosterPDFs/NeurIPS%202022/62157.png" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="new-unstructured-data-catalog" style="position:relative;">New Unstructured Data Catalog<a href="#new-unstructured-data-catalog" aria-label="new unstructured data catalog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new
solution for finding and managing your datasets of unstructured data like
images, audio files, and PDFs! Extend your DVC environment with the first data
catalog and query language (SQL->DQL) for unstructured data and machine
learning. Learn more on <a href="https://iterative.ai/data-catalog-for-ml" target="_blank" rel="nofollow noopener noreferrer">our website</a>
and/or <a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">schedule a meeting with us</a>!</p>
<h2 id="mlem" style="position:relative;">MLEM<a href="#mlem" aria-label="mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 250px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c737dd4c5b3890a090185b9f3ed858b6/39600/dog-on-a-broomstick.png" alt="MLEM Sagemaker and Kubernetes
deployment" title="MLEM adds
Kubernetes and Sagemaker Deployment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
In case you missed it MLEM announced a release on Halloween! MLEM now supports
<a href="https://mlem.ai/doc/user-guide/deploying/sagemaker" target="_blank" rel="nofollow noopener noreferrer">Sagemaker</a> and
<a href="https://mlem.ai/doc/user-guide/deploying/kubernetes" target="_blank" rel="nofollow noopener noreferrer">Kubernetes</a> in addition to
<a href="https://mlem.ai/doc/user-guide/deploying/heroku" target="_blank" rel="nofollow noopener noreferrer">Heroku</a> and
<a href="https://mlem.ai/doc/user-guide/deploying/docker" target="_blank" rel="nofollow noopener noreferrer">Docker</a>. You can learn about
how easy it now is to package your models for deployment with only a few lines
of code and never have to get lost in Kubernetes docs again! Find the
<a href="https://iterative.ai/blog/mlem-k8s-sagemaker" target="_blank" rel="nofollow noopener noreferrer">blog post here</a> and be sure to
<a href="https://mlem.ai/doc/user-guide/deploying" target="_blank" rel="nofollow noopener noreferrer">visit the docs</a>!</p>
<h2 id="soc-2-type-1-compliance" style="position:relative;">SOC 2 Type 1 Compliance<a href="#soc-2-type-1-compliance" aria-label="soc 2 type 1 compliance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 250px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/844739961f1b26d85af1f3657ed1f21e/39600/soc-2-cover.png" alt="Iterative Achieves SOC 2 Type 1
Compliance" title="Iterative Achieves SOC 2
Type 1 Compliance" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
We are very excited to announce that Iterative is now SOC 2 Type 1 compliant.
This certification signals to our customers our commitment to Security,
Availability, Processing Integrity, Confidentiality, and Privacy within our
organization. We have successfully endured the rigorous process and have learned
much as a team in the process.
<a href="https://www.linkedin.com/in/gurobokum/" target="_blank" rel="nofollow noopener noreferrer"><strong>Guro Bokum</strong></a> reviews the five key
learnings <a href="https://iterative.ai/blog/SOC-2" target="_blank" rel="nofollow noopener noreferrer">in this blog piece</a>. You can find
the full report on our
<a href="https://iterative.ai/security-and-privacy" target="_blank" rel="nofollow noopener noreferrer">Security and Privacy</a> page.</p>
<h2 id="dmitry-petrov-at-github-universe" style="position:relative;">Dmitry Petrov at GitHub Universe<a href="#dmitry-petrov-at-github-universe" aria-label="dmitry petrov at github universe permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>On November 8th, our CEO, <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a>
spoke at <a href="https://githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on <em>ML with Git:
experiment tracking in Codespaces.</em> In his presentation, he shows how to use the
DVC extension for VS Code and Codespaces to streamline your machine learning
experimentation process. You can find his video below in the event platform if
you are registered. We expect the video to be available on YouTube in the next
of couple months. We'll keep you updated!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e0832e89812503e54d5cd6f6b33c73ab/03346/gh-universe.jpg" alt="Dmitry Petrov at GitHub Universe" title="Dmitry Petrov at GitHub Universe" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Dmitry Petrov during his talk, 𝗠𝗟 𝘄𝗶𝘁𝗵 𝗚𝗶𝘁: 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝘁𝗿𝗮𝗰𝗸𝗶𝗻𝗴 𝗶𝗻 𝗖𝗼𝗱𝗲𝘀𝗽𝗮𝗰𝗲𝘀</em></p>
<h2 id="rob-de-wit---from-jupyter-notebook-to-dvc-pipeline-for-reproducible-ml-experiments" style="position:relative;">Rob de Wit - From Jupyter Notebook to DVC pipeline for reproducible ML experiments<a href="#rob-de-wit---from-jupyter-notebook-to-dvc-pipeline-for-reproducible-ml-experiments" aria-label="rob de wit from jupyter notebook to dvc pipeline for reproducible ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Jupyter Notebooks are great for prototyping, but eventually, you will want to
move toward reproducible experiments. Converting a notebook to a DVC pipeline
requires a bit of a mental shift.
<a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> shows you how to
accomplish it with an intermediate step: use
<a href="https://papermill.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">Papermill</a> to build a one-stage
DVC pipeline that executes our entire notebook, and use the resulting pipeline
to run and version ML experiments. Look out for a future post with a more
advanced pipeline!</p>
<p><img src="https://media.giphy.com/media/wnWvARibI7pykx0mTf/giphy.gif" alt="Dvc GIF"></p>
<h2 id="meetups" style="position:relative;">Meetups<a href="#meetups" aria-label="meetups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>At our next meetup on December 14th,
<a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> will
present <em>An Open Discussion of Parallel data pipelines with DVC and TPI</em>, an
advanced use case for distributing experiments in the cloud. Sami is a great
discussion driver. If you are interested in higher-level use cases you will want
to join the discussion!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289771497/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Sami Jawhar on Running Parallel Pipelines with DVC and TPI</h4>
<div class="elp-description">Join us on December 14th for an open discussion on Running Parallel Pipelines with DVC and
TPI!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-11-18/meetup-4329bdadd3d5940e6d7cd7bf26842a27.png" alt="Sami Jawhar on Running Parallel Pipelines with DVC and TPI">
</div>
</a>
</section>
<p></p>
<p>On January 11th,
<a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a>
will be joining us to share about his recent contribution to MLEM through his
work on GTO and how this helps him in his work at
<a href="https://www.eon.de/de/pk.html" target="_blank" rel="nofollow noopener noreferrer">E.On Energie Deutschland</a> with creating a
Git-based model registry.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289772002/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Francesco Calcavecchia on Designing a model Registry with Legacy Systems
using DVC and GTO</h4>
<div class="elp-description">Join us on January 11th. Designing a Model
Registry with Legacy Systems using GTO!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-11-18/meetup-4329bdadd3d5940e6d7cd7bf26842a27.png" alt="Francesco Calcavecchia on Designing a model Registry with Legacy Systems
using DVC and GTO">
</div>
</a>
</section>
<p></p>
<h2 id="events" style="position:relative;">Events<a href="#events" aria-label="events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="odsc-west" style="position:relative;">ODSC West<a href="#odsc-west" aria-label="odsc west permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We had a great time at <a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a>! We had great
conversations with conferencegoers and attended great sessions! Dmitry had a
packed room for his in-person talk <em>Why You Need a GitOps-based Machine Learning
Model Registry</em> and <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> presented
<em>CI/CD for Machine Learning</em> virtually. At each of the conferences we've
sponsored this year, we've had a game called Deevee's Ramen Run. (If you don't
know the Ramen connection, you need to spend more time reading the monthly
Heartbeats 😉). Below find the top three winners of the game.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6675d96b036ae7ca3df5ba3acd230859/39600/winners.png" alt="Winners of DeeVees Ramen Run" title="Winners of DeeVees Ramen Run" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Winners 1st - 3rd shown above: Alexandra Hagmeyer (pictured with myself and
teammate Daniel Barnes), Ryan Renslow, and (name asked to be withheld, but she
was good with the picture and DeeVee!)</em></p>
<h3 id="mlops-summit-london" style="position:relative;">MLOps Summit London<a href="#mlops-summit-london" aria-label="mlops summit london permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We were also part of the
<a href="https://london-ml-ops.re-work.co/" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit in London</a> only a week later!
Admittedly, there were different team members in attendance and staffing the
booth. Aside from attending a variety of great talks, we met many wonderful
people from all over the world. This resulted in some really interesting
discussions about how different companies approach MLOps.</p>
<p>Casper da Costa-Luis gave a well-received talk on how to painlessly run ML
experiments in the cloud with CML at the summit. The recording will be made
available in the near future, so look out for that! The talk answered at least
one of the questions of Deevee's Ramen Run, which yielded
<a href="https://www.linkedin.com/posts/rebecca-gorringe_machinelearning-iterative-reworkai-activity-6998338419772772353-FUip?utm_source=share&utm_medium=member_desktop" target="_blank" rel="nofollow noopener noreferrer">some surprised (but excited!) winners</a>
this time around.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/94acec2e1c21ff406e9be6340cda88ed/39600/team.png" alt="Iterative Team at MLOps Summit - London" title="Iterative Team at MLOps Summit - London" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative Team members, clockwise from top right: Rob de Wit, Gema Parreño
Piqueras, Casper da Costa-Luis, and Chaz Black)</em></p>
<h3 id="techweek" style="position:relative;">TechWeek<a href="#techweek" aria-label="techweek permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> presented at
<a href="https://www.ambito.com/negocios/tecnologia/comenzo-la-tech-week-latam-y-espana-mas-600-ofertas-empleo-it-n5578240" target="_blank" rel="nofollow noopener noreferrer">TechWeek</a>
in Spain with her talk <em>Reproducibilty and Version Control are Important: Follow
up with the DVC extension for VS Code</em>. She will be presenting the same talk at
<a href="https://events.codemotion.com/conferences/online/2022/online-tech-conference-2022-spanish-edition-autumn" target="_blank" rel="nofollow noopener noreferrer">Codemotion</a>.
You can find her talk in Spanish at 2:02 below!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/zXl9qINlbcI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="upcoming-events" style="position:relative;">Upcoming events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>We will be participating in
<a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> -
on November 29-30 in Toronto</li>
<li><a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> <em>CI/CD for Machine Learning</em>
for an ODSC Webinar.
<a href="https://app.aiplus.training/courses/CI-CD-for-Machine-Learning" target="_blank" rel="nofollow noopener noreferrer">Register here.</a></li>
<li>We will be at <a href="https://pydata.org/eindhoven2022/" target="_blank" rel="nofollow noopener noreferrer">PyData Eindhoven</a> on
December 2nd. Come say hi at the booth if you are attending! We have some
tickets to give away for the event in
<a href="https://discord.com/channels/485586884165107732/497187456051970048/1036999675951190056" target="_blank" rel="nofollow noopener noreferrer">Discord</a>.
First come first serve!</li>
<li>We are sponsoring <a href="https://normconf.com/" target="_blank" rel="nofollow noopener noreferrer">NormConf</a> on December 15th. They
will have Slack-based booths there. We are looking forward to supporting this
new conference!</li>
</ul>
<p>Stay tuned to
<a href="https://iterative.ai/#:~:text=Go%20to%20Twitter-,Subscribe,-for%20updates.%20We" target="_blank" rel="nofollow noopener noreferrer">our Newsletter </a>
for what we will be up to conference-wise in 2023!</p>
<h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/BemKqR9RDK4V2/giphy.gif" alt="Computer Working GIF"></p>
<p>The team has been busy improving the docs for you. See all the latest and
greatest updates below.</p>
<h3 id="dvc-docs" style="position:relative;">DVC Docs<a href="#dvc-docs" aria-label="dvc docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li><a href="https://dvc.org/doc/api-reference/dvcfilesystem" target="_blank" rel="nofollow noopener noreferrer">DVCFileSystem</a> -
DVCFileSystem provides a pythonic file interface
( <a href="https://filesystem-spec.readthedocs.io/" target="_blank" rel="nofollow noopener noreferrer">fsspec-compatible</a> ) for a DVC
repo. It is read-only. DVCFileSystem provides a unified view of all the
files/directories in your repository, be it Git-tracked or DVC-tracked, or
untracked (in the case of a local repository). It can reuse the files in the
DVC cache and can otherwise stream
from <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">supported remote storage</a>.</li>
<li>We’ve now added
<a href="https://dvc.org/doc/command-reference/plots/show#example-horizontal-bar-plot" target="_blank" rel="nofollow noopener noreferrer">Horizontal bar plots</a>
to the mix of <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a> !</li>
<li>You can now list contents from supported URLs with <a href="https://dvc.org/doc/command-reference/list-url"><code>dvc ls-url</code></a> Find the
description, options, and example code
<a href="https://dvc.org/doc/command-reference/list-url" target="_blank" rel="nofollow noopener noreferrer">here.</a></li>
<li>Based on some feedback we reorganized the
<a href="https://dvc.org/doc/user-guide/overview" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> to help you better
navigate. Let us know what you think!</li>
<li>Similarly, we reorganized the
<a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive documentation</a> for better navigation.</li>
</ul>
<h3 id="cml-docs" style="position:relative;">CML docs<a href="#cml-docs" aria-label="cml docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>In CML you can now publicly self-host images with <code>cml comment</code>. Find the
options <a href="https://cml.dev/doc/ref/comment#--publish" target="_blank" rel="nofollow noopener noreferrer">here.</a></li>
<li>Also, we’ve updated the
<a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">self-hosted runners</a> docs in CML.</li>
<li>We've now added a guide for bringing your data to GitLab using DVC. Find the
details <a href="https://cml.dev/doc/cml-with-dvc?tab=GitLab" target="_blank" rel="nofollow noopener noreferrer">in this doc.</a></li>
</ul>
<h3 id="mlem-docs" style="position:relative;">MLEM docs<a href="#mlem-docs" aria-label="mlem docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM docs</a> have received a nearly full overhaul.</li>
<li>Additionally the <a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a> section has
been greatly improved.</li>
<li>Look out for new docs to come out soon for
<a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> on the <a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>
website.</li>
</ul>
<h3 id="iterative-studio-docs" style="position:relative;">Iterative Studio docs<a href="#iterative-studio-docs" aria-label="iterative studio docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>DataChain Studio now supports adding a model from a remote location in
Iterative Studio. Find out more
<a href="https://dvc.org/doc/studio/user-guide/model-registry/add-a-model" target="_blank" rel="nofollow noopener noreferrer">here</a>.</li>
<li>Use the new Iterative Studio Wizard to set up CML in your CI. More on the
process and parameters
<a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/run-experiments#use-the-iterative-studio-wizard-to-set-up-your-ci-action" target="_blank" rel="nofollow noopener noreferrer">here in the docs.</a></li>
</ul>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/mlem-k8s-sagemakerhttps://dvc.org/blog/mlem-k8s-sagemakerMon, 31 Oct 2022 00:00:00 GMT<p>To establish the deployment to cloud platforms, you have to learn how they work,
their secrets, and their quirks. To simplify your daily Swiss-army-knife ML
duties, you’ll need to write complicated bash scripts, figure out what arguments
needs to be supplied to the platform CLI tool or API methods, call them in the
correct way and embrace the burden of limitless extension of your knowledge to
one more Cloud Platform tool.</p>
<p>But, it doesn’t have to always be that way. Some tools like Terraform help you
with managing the infrastructure in a cloud-agnostic way, so why can’t we invent
the same for MLOps?</p>
<p>That’s why we’re releasing new Deployment mechanics for MLEM, along with 4
supported deployment targets:
<a href="https://mlem.ai/doc/user-guide/deploying/docker" target="_blank" rel="nofollow noopener noreferrer">Docker container deploy</a>,
<a href="https://mlem.ai/doc/user-guide/deploying/heroku" target="_blank" rel="nofollow noopener noreferrer">Heroku</a>,
<a href="https://mlem.ai/doc/user-guide/deploying/kubernetes" target="_blank" rel="nofollow noopener noreferrer">Kubernetes</a>, and
<a href="https://mlem.ai/doc/user-guide/deploying/sagemaker" target="_blank" rel="nofollow noopener noreferrer">AWS SageMaker</a>.</p>
<p><img src="https://media.giphy.com/media/bfOb3UnSzQvTsBKLmq/giphy.gif" alt="Docker, Heroku, Kubernetest, and SageMaker, in person"></p>
<h1 id="deploying-with-a-single-command" style="position:relative;">Deploying with a single command<a href="#deploying-with-a-single-command" aria-label="deploying with a single command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>MLEM strives to abstract away all the stuff you need to do for deployment. Once
you configure kubectl with your cluster IP and credentials, you can deploy your
model as simple as:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem deployment run</span> kubernetes app.mlem <span class="token punctuation">\</span>
<span class="token parameter variable">--model</span> model <span class="token parameter variable">--service_type</span> loadbalancer
</span>⏳️ Loading model from model.mlem
💾 Saving deployment to service_name.mlem
🛠 Creating docker image app
🛠 Building MLEM wheel file...
💼 Adding model files...
🛠 Generating dockerfile...
💼 Adding sources...
💼 Generating requirements file...
🛠 Building docker image app:4ee45dc33804b58ee2c7f2f6be447cda...
✅ Built docker image app:4ee45dc33804b58ee2c7f2f6be447cda
namespace created. status='{'conditions': None, 'phase': 'Active'}'
deployment created. status='{'available_replicas': None,
'collision_count': None,
'conditions': None,
'observed_generation': None,
'ready_replicas': None,
'replicas': None,
'unavailable_replicas': None,
'updated_replicas': None}'
service created. status='{'conditions': None, 'load_balancer': {'ingress': None}}'
✅ Deployment app is up in mlem namespace</code></pre></div>
<p>The <code>app.mlem</code> is a file that is going to have all the information about the
deployment that we specified. Later we can use it to deploy a new model version.</p>
<p>This created <code>deployment</code> and <code>service</code> resources in the cluster. Let’s check
out pods that were created by the <code>deployment</code> (all the resources are placed
in <code>mlem</code> namespace by default):</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">kubectl</span> get pods <span class="token parameter variable">--namespace</span> mlem
</span>NAMESPACE NAME READY STATUS RESTARTS AGE
mlem app-cddbcc89b-zkfhx 1/1 Running 0 5m58s</code></pre></div>
<h1 id="getting-predictions" style="position:relative;">Getting predictions<a href="#getting-predictions" aria-label="getting predictions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Since our model above is reachable by HTTP request, we can just open the URL and
see the OpenAPI spec there (like
<a href="http://example-mlem-get-started-app.herokuapp.com/docs" target="_blank" rel="nofollow noopener noreferrer">this one</a>), or send
requests to get predictions. We can also use built-in MLEM functionality to
achieve the same:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem deployment apply</span> app.mlem data.csv <span class="token parameter variable">--json</span>
</span>[0, 1, 2]</code></pre></div>
<h1 id="extend-your-learning" style="position:relative;">Extend your learning<a href="#extend-your-learning" aria-label="extend your learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>That’s it: deployment to cloud providers is as simple as it can be. MLEM helps
you to simplify your daily routine and help you focus on developing the models
and not spending time getting into the DevOps weeds.</p>
<ul>
<li>To learn how MLEM can help you, try out the
<a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started Tutorial</a> or
<a href="https://mlem.ai/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">Use Cases</a>.</li>
<li>To see a full-scale Tutorial for Kubernetes, Sagemaker or Heroku, check out
our <a href="https://mlem.ai/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a>.</li>
<li>To quickly get your questions answered, reach us in
<a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a>
or <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">GitHub issues</a>.</li>
</ul>
<h1 id="whats-next" style="position:relative;">What’s next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>It’s been five months since we released MLEM on the 1st of June, and now it’s
October 31st already. With all these big deployment targets, MLEM finally looks
like a formidable little dog 🎃. What’s next on the agenda?</p>
<ul>
<li>We’re going to work on an
<strong><a href="https://github.com/iterative/mlem/issues/454" target="_blank" rel="nofollow noopener noreferrer">e2e Computer Vision scenario</a></strong>.
Think about training a NN to classify images, saving it with MLEM, and
deploying it to K8s or Sagemaker.</li>
<li>We are going to share how to use MLEM when your model
<a href="https://github.com/iterative/mlem/issues/283" target="_blank" rel="nofollow noopener noreferrer">consists of two parts: <strong>preprocessing and inference</strong></a>.</li>
<li>Batch processing is something we received many requests about. We’ll set up an
example of how to use
<a href="https://github.com/iterative/mlem/issues/11" target="_blank" rel="nofollow noopener noreferrer"><strong>MLEM with Airflow</strong></a> and
publish it. 📚</li>
</ul>
<p>Happy to hear your thoughts on this!</p>
<p>Machine Learning should be <del>mlemming</del> scary! Once a year only.</p>
<p><img src="https://media.giphy.com/media/dlYIz2AoqR5GcqZ1Yk/giphy.gif" alt="Happy Halloween!"></p>https://dvc.org/blog/jupyter-notebook-dvc-pipelinehttps://dvc.org/blog/jupyter-notebook-dvc-pipelineMon, 24 Oct 2022 00:00:00 GMT<p>While every data scientist has their own methods and approaches to conducting
data science, there is one tool that nearly everyone in the field uses:
<a href="https://jupyter.org/" target="_blank" rel="nofollow noopener noreferrer">Jupyter Notebook</a>. Its ease of use makes it the perfect
tool for prototyping, usually resulting in a script in which we preprocess the
data, do a train/test split, train our model, and evaluate it.</p>
<p>However, once we have a decent prototype, the subsequent iterations generally
don’t touch most of the code. Instead, we tend to focus on tweaking feature
engineering parameters and tuning model hyperparameters. At this point, we
really start experimenting, trying to answer questions such as <em>“What happens if
I increase the learning rate?”</em> and <em>“What’s the optimal batch size?”</em></p>
<p>It will take numerous experiments to get to an acceptable level of performance
for our model. But with so many experiments, it becomes difficult to keep track
of the changes. In turn, this makes it difficult to go back in time to a certain
point and see what combination of data, code, and parameters constituted a
specific experiment. In other words, we cannot <em>reproduce</em> our experiments.</p>
<admon type="info">
<p>Reproducibility is a core concept of our data science philosophy here at
Iterative. If you are new to the concept, I recommend reading
<a href="https://iterative.ai/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">this blog post by Dave Berenbaum</a>
or
<a href="https://neptune.ai/blog/how-to-solve-reproducibility-in-ml" target="_blank" rel="nofollow noopener noreferrer">this one by Ejiro Onoso</a>.</p>
</admon>
<p>We can solve our need for reproducibility by transforming our notebook into a
codified pipeline with defined inputs and outputs. This will allow us to then
save every experiment that modifies the inputs, pipeline, or outputs. In this
guide, we will explore how to do this using <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>. It extends
Git so that in addition to code and parameters we can track and version data and
models.</p>
<h2 id="what-well-be-doing" style="position:relative;">What we’ll be doing<a href="#what-well-be-doing" aria-label="what well be doing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>While a pipeline typically consists of multiple <em>stages</em>, transforming our
notebook straight into a multi-stage DVC pipeline may seem somewhat daunting.
For the sake of simplicity, we will create a pipeline with just one stage for
now: run all of the code in our notebook. Just like any other pipeline, we will
have defined inputs (data and parameters) and outputs (model, evaluation
metrics, and plots).</p>
<p>To achieve this, we will wrap our notebook with
<a href="https://papermill.readthedocs.io/en/latest/usage-workflow.html" target="_blank" rel="nofollow noopener noreferrer">Papermill</a>.
With this tool, we can parameterize our notebook and run experiments
<a href="https://papermill.readthedocs.io/en/latest/usage-execute.html#execute-via-cli" target="_blank" rel="nofollow noopener noreferrer">from our CLI with a single command</a>.</p>
<p>Throughout this guide, we will do the following:</p>
<ol>
<li>Parameterize a notebook using Papermill;</li>
<li>Create a single-stage pipeline with DVC;</li>
<li>Version our data, model, and other large artifacts using DVC; and</li>
<li>Run multiple experiments using the new pipeline.</li>
</ol>
<p>As an example project, we will be using a notebook I created that trains a
classifier for Pokémon sprites. You can find this project in
<a href="https://github.com/iterative/example-pokemon-classifier/tree/snapshot-jupyter" target="_blank" rel="nofollow noopener noreferrer">the repository here</a>.
Make sure to follow the instructions in <code>README.md</code> to set up the development
environment and to <code>git checkout snapshot-jupyter</code> to get our starting point for
this guide.</p>
<p>Of course, you can also follow along using a notebook you created yourself! In
that case, you will at least need to install <code>dvc</code> and <code>papermill</code>. You will
also need to initialize DVC through <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>.</p>
<admon type="tip">
<p>If you're using <a href="https://code.visualstudio.com/" target="_blank" rel="nofollow noopener noreferrer">Visual Studio Code</a> as your
IDE, I also recommend
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">installing the DVC extension</a>.
This will make it even easier to run and compare experiments!</p>
</admon>
<h2 id="guide" style="position:relative;">Guide<a href="#guide" aria-label="guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Stages in a DVC pipeline consist of commands as we could run them in our own
terminal. As such, we need a way to run the contents of our notebook from our
command line. This is where Papermill comes in. With the following command we
execute the entire notebook as a single unit without changing its contents:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ papermill <span class="token punctuation">\</span>
notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span>
outputs/completed_notebook.ipynb</code></pre></div>
<p>The result is saved as a new notebook in <code>outputs/completed_notebook.ipynb</code>.</p>
<h3 id="parameterize-notebook" style="position:relative;">Parameterize notebook<a href="#parameterize-notebook" aria-label="parameterize notebook permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>While we would technically have a DVC pipeline if we added this command as a
stage, its usefulness would be somewhat limited. After all, the result would be
the same every time we execute the command. To start experimenting with our
pipeline, we need to parameterize our notebook. We do so by creating a single
cell at the top of our notebook where we declare the parameters:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">SEED<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">42</span>
POKEMON_TYPE_TRAIN<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"Water"</span>
SOURCE_DIRECTORY<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"data/external"</span>
DESTINATION_DIRECTORY<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"data/processed"</span>
TRAIN_DATA_IMAGES<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"images-gen-1-8"</span>
TRAIN_DATA_LABELS<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"stats/pokemon-gen-1-8.csv"</span>
MODEL_TEST_SIZE<span class="token punctuation">:</span> <span class="token builtin">float</span> <span class="token operator">=</span> <span class="token number">0.2</span>
MODEL_LEARNING_RATE<span class="token punctuation">:</span> <span class="token builtin">float</span> <span class="token operator">=</span> <span class="token number">0.001</span>
MODEL_EPOCHS<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">10</span>
MODEL_BATCH_SIZE<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">120</span></code></pre></div>
<p>Papermill
<a href="https://papermill.readthedocs.io/en/latest/usage-parameterize.html#designate-parameters-for-a-cell" target="_blank" rel="nofollow noopener noreferrer">needs a <code>parameters</code> tag</a>
to recognize this cell as the one containing our parameters. To add this tag to
the cell, we go to <code>View / Cell Toolbar</code> and enable <code>Tags</code>. Afterward, we type
in <code>parameters</code> in the top right corner of our cell.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 491px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/44511a7489df0e34ed81cc441214f754/5c810/jupyter-tags.png" alt="Enabling Tags for Jupyter Notebook
cells" title="Enabling Tags for Jupyter Notebook
cells" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Enabling
Tags for Jupyter Notebook cells</em></p>
<admon type="tip">
<p>In case you’re running the notebook straight from VS Code, please be aware that
<a href="https://github.com/microsoft/vscode-jupyter-powertoys/issues/48" target="_blank" rel="nofollow noopener noreferrer">editing cell tags is not natively supported here</a>.
You can use the
<a href="https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-jupyter-cell-tags" target="_blank" rel="nofollow noopener noreferrer">Jupyter Cell Tags extension</a>
or the editor in Jupyter Server as shown above.</p>
</admon>
<p>We can now replace hard-coded parameters in our notebook with references to the
variables we defined. For example, we change the following section of code like
so:</p>
<div class="gatsby-highlight" data-language="diff"><pre class="language-diff-python"><code class="language-diff-python">estimator = model.fit(X_train, y_train,
<span class="token unchanged language-python"><span class="token prefix unchanged"> </span> validation_data <span class="token operator">=</span> <span class="token punctuation">(</span>X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token prefix unchanged"> </span> class_weight <span class="token operator">=</span> calculate_class_weights<span class="token punctuation">(</span>y_train<span class="token punctuation">)</span><span class="token punctuation">,</span>
</span><span class="token deleted-sign deleted language-python"><span class="token prefix deleted">-</span> epochs <span class="token operator">=</span> <span class="token number">10</span><span class="token punctuation">,</span>
</span><span class="token inserted-sign inserted language-python"><span class="token prefix inserted">+</span> epochs <span class="token operator">=</span> MODEL_EPOCHS<span class="token punctuation">,</span>
</span><span class="token deleted-sign deleted language-python"><span class="token prefix deleted">-</span> batch_size <span class="token operator">=</span> <span class="token number">120</span><span class="token punctuation">,</span>
</span><span class="token inserted-sign inserted language-python"><span class="token prefix inserted">+</span> batch_size <span class="token operator">=</span> MODEL_BATCH_SIZE<span class="token punctuation">,</span>
</span><span class="token unchanged language-python"><span class="token prefix unchanged"> </span> verbose <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">)</span></span></code></pre></div>
<p>Now we can run our notebook through Papermill with changed parameters:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ papermill <span class="token punctuation">\</span>
notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span>
outputs/completed_notebook.ipynb <span class="token punctuation">\</span>
<span class="token parameter variable">-p</span> MODEL_EPOCHS <span class="token number">15</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-p</span> MODEL_BATCH_SIZE <span class="token number">160</span></code></pre></div>
<h3 id="create-dvc-pipeline" style="position:relative;">Create DVC pipeline<a href="#create-dvc-pipeline" aria-label="create dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>With our parameterized notebook in place, we can create our pipeline with DVC.
Our pipeline consists of stages (in this case: one stage) and has inputs and
outputs. For our model, the inputs will be the required datasets and our
notebook. The pipeline’s outputs will be the model itself, a graph showing the
training process, and a confusion matrix for the model’s predictions.</p>
<p>Additionally, a pipeline can have metrics and plots. We will define several
metrics that allow us to compare model performance across different experiments,
such as accuracy and F1 scores.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f6c1b1df7a76c455086a0ebc527b7c66/39600/pipeline-components.png" alt="All of the pipeline
components" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Our
inputs, pipeline, and outputs</em></p>
<p>A DVC pipeline is defined in a dedicated <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. We can add stages
manually in this file, which you generally want to do when building complex,
multi-stage pipelines. However, to get started, it’s probably easier if we use
the <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command. We use the <code>-n</code> option to provide a name for the
stage, the <code>-d</code> option to specify our dependencies, the <code>-o</code> option to specify
our outputs, and the <code>-M</code> option to specify our metrics file. Lastly, we type in
the command that DVC should execute for that stage:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-n</span> run_notebook <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> data/external/images-gen-1-8 <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> data/external/stats/pokemon-gen-1-8.csv <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> data/processed/pokemon <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> data/processed/pokemon.csv <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> data/processed/pokemon-with-image-paths.csv <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> outputs/model.pckl <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> outputs/confusion_matrix.png <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> outputs/train_history.png <span class="token punctuation">\</span>
<span class="token parameter variable">-M</span> outputs/metrics.yaml <span class="token punctuation">\</span>
papermill notebooks/pokemon_classifier.ipynb outputs/pokemon_classifier_out.ipynb</span></code></pre></div>
<p>The uppercase <code>-M</code> option (as opposed to the lowercase <code>-m</code> option) tells DVC
not to track the resulting metrics file. We typically want to do this with
metrics because the files are small enough to be tracked by Git directly.</p>
<p>The resulting <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> looks as follows:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">run_notebook</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string">
papermill notebooks/pokemon_classifier.ipynb
outputs/pokemon_classifier_out.ipynb</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> notebooks/pokemon_classifier.ipynb
<span class="token punctuation">-</span> data/external/images<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span><span class="token number">8</span>
<span class="token punctuation">-</span> data/external/stats/pokemon<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span>8.csv
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/processed/pokemon
<span class="token punctuation">-</span> data/processed/pokemon.csv
<span class="token punctuation">-</span> data/processed/pokemon<span class="token punctuation">-</span>with<span class="token punctuation">-</span>image<span class="token punctuation">-</span>paths.csv
<span class="token punctuation">-</span> outputs/model.pckl
<span class="token punctuation">-</span> outputs/confusion_matrix.png
<span class="token punctuation">-</span> outputs/train_history.png
<span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">outputs/metrics.yaml</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div>
<p>With that, we have our pipeline in its basic form! We can run the pipeline with
the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command, and DVC will execute our notebook. We have yet to
specify our parameters, however. Otherwise, every pipeline <em>run</em> would utilize
the default parameters we defined in our notebook.</p>
<p>DVC parses in the values for parameters from another YAML file: <code>params.yaml</code>.
We can declare the same parameters here that we previously incorporated in our
notebook. To provide a little bit of structure, let’s also group them in
sections:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">base</span><span class="token punctuation">:</span>
<span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">42</span>
<span class="token key atrule">pokemon_type_train</span><span class="token punctuation">:</span> <span class="token string">'Water'</span>
<span class="token key atrule">data_preprocess</span><span class="token punctuation">:</span>
<span class="token key atrule">source_directory</span><span class="token punctuation">:</span> <span class="token string">'data/external'</span>
<span class="token key atrule">destination_directory</span><span class="token punctuation">:</span> <span class="token string">'data/processed'</span>
<span class="token key atrule">dataset_labels</span><span class="token punctuation">:</span> <span class="token string">'stats/pokemon-gen-1-8.csv'</span>
<span class="token key atrule">dataset_images</span><span class="token punctuation">:</span> <span class="token string">'images-gen-1-8'</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">test_size</span><span class="token punctuation">:</span> <span class="token number">0.2</span>
<span class="token key atrule">learning_rate</span><span class="token punctuation">:</span> <span class="token number">0.001</span>
<span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">15</span>
<span class="token key atrule">batch_size</span><span class="token punctuation">:</span> <span class="token number">120</span></code></pre></div>
<p>We can now update our pipeline in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to read the parameters from
<code>params.yaml</code>. The file is detected automatically by DVC and we can parse the
values into the <code>papermill</code> command with the <code>-p</code> option. The result will look
like this:</p>
<div class="gatsby-highlight" data-language="diff"><pre class="language-diff-yaml"><code class="language-diff-yaml">stages:
<span class="token unchanged language-yaml"><span class="token prefix unchanged"> </span> <span class="token key atrule">run_notebook</span><span class="token punctuation">:</span>
<span class="token prefix unchanged"> </span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string">
<span class="token prefix unchanged"> </span> papermill
<span class="token prefix unchanged"> </span> notebooks/pokemon_classifier.ipynb
<span class="token prefix unchanged"> </span> outputs/pokemon_classifier_out.ipynb</span>
</span><span class="token inserted-sign inserted language-yaml"><span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p SEED $<span class="token punctuation">{</span>base.seed<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p POKEMON_TYPE_TRAIN $<span class="token punctuation">{</span>base.pokemon_type_train<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p SOURCE_DIRECTORY $<span class="token punctuation">{</span>data_preprocess.source_directory<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p DESTINATION_DIRECTORY $<span class="token punctuation">{</span>data_preprocess.destination_directory<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p TRAIN_DATA_IMAGES $<span class="token punctuation">{</span>data_preprocess.dataset_images<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p TRAIN_DATA_LABELS $<span class="token punctuation">{</span>data_preprocess.dataset_labels<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_TEST_SIZE $<span class="token punctuation">{</span>train.test_size<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_LEARNING_RATE $<span class="token punctuation">{</span>train.learning_rate<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_EPOCHS $<span class="token punctuation">{</span>train.epochs<span class="token punctuation">}</span>
<span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_BATCH_SIZE $<span class="token punctuation">{</span>train.batch_size<span class="token punctuation">}</span>
</span><span class="token unchanged language-yaml"><span class="token prefix unchanged"> </span> <span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> notebooks/pokemon_classifier.ipynb
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/external/images<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span><span class="token number">8</span>
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/external/stats/pokemon<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span>8.csv
<span class="token prefix unchanged"> </span> <span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon.csv
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon<span class="token punctuation">-</span>with<span class="token punctuation">-</span>image<span class="token punctuation">-</span>paths.csv
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/model.pckl
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/confusion_matrix.png
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/train_history.png
<span class="token prefix unchanged"> </span> <span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> <span class="token key atrule">outputs/metrics.yaml</span><span class="token punctuation">:</span>
<span class="token prefix unchanged"> </span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></span></code></pre></div>
<p>And with that, we have our pipeline ready for use! Before we start running
experiments with it, however, let’s ensure everything is tracked and versioned
properly so we can reproduce our experiments later on.</p>
<h3 id="version-our-data-models-and-plots-with-dvc" style="position:relative;">Version our data, models, and plots with DVC<a href="#version-our-data-models-and-plots-with-dvc" aria-label="version our data models and plots with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>As we discussed earlier, we want to version every component of our experiments
to achieve true reproducibility: code, parameters, data, models, metrics, and
plots. We want to version small files (usually text) with Git and larger files
with DVC. That principle gives us the following split between the two:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2dbaf41ab7b1a61e962fd0a331d57002/39600/versioning-components.png" alt="Versioning all of the pipeline
components" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Every
component of our experiment is versioned either by Git or DVC</em></p>
<p>When we created our pipeline in the previous step, DVC automatically started
tracking the outputs we defined and listed them in our <code>.gitignore</code>. On the
other hand, the metrics file is ignored by DVC and still tracked by Git
(<code>cache: false</code>), because we added it with the upper case <code>-M</code> option. If we
wanted to track the metrics with DVC as well, we could change this in our
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<p>There is one last output of the pipeline we haven't yet accounted for:
<code>outputs/completed_notebook.ipynb</code>. Because it's a rather large file that we
don't really need for anything, we can simply add it to our <code>.gitignore</code>. After
all, we can always reproduce it by rerunning our pipeline!</p>
<p>With that, every component (of importance) in our project is now versioned by
Git or DVC. That means we now have the reproducible pipeline we set out to
create: we can go back to any experiment and get the precise combination of
code, data, parameters, and results. This will make it much easier to conduct
experiments, find the best-performing model, and collaborate with teammates.</p>
<p>Let’s take our pipeline for a ride and run some experiments!</p>
<admon type="info">
<p>At this point, we would typically also configure our DVC remote to make sure our
versioning not only exists on our local system. This is outside the scope of
this guide, but you can find guides for
<a href="https://iterative.ai/blog/using-gcp-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Platform</a>,
<a href="https://iterative.ai/blog/azure-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Azure Blob Storage</a>, and
<a href="https://iterative.ai/blog/aws-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Amazon Web Services</a> on our blog.</p>
</admon>
<h3 id="running-experiments" style="position:relative;">Running experiments<a href="#running-experiments" aria-label="running experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are two ways in which we can run experiments with our newly defined
pipeline. The first one utilizes our good ol’ command line interface. We can use
<a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> to run an experiment after we have changed the parameters in
<code>params.yaml</code>, or we could change the parameters in the command itself with the
<code>-S</code> option. The following command would trigger a new experiment with an
updated number of epochs, for example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'train.epochs=25'</span></span></code></pre></div>
<p>However, if we’re using <a href="https://code.visualstudio.com/" target="_blank" rel="nofollow noopener noreferrer">Visual Studio Code</a> as
our IDE of choice, we can also use
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">the DVC extension</a>
to run and visualize experiments through a graphical user interface. We can go
to the experiment table and, from there, modify, queue, and run new experiments.
The results will be shown below each other, providing an easy way to compare
their outcomes.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-10-24/dvc-vscode-extension-3d9c6b635560d7ec3f20532230bca57d.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<p>Now, all there’s left to do is to start experimenting and find the best possible
model! Once we have drawn our conclusions from experimenting, we can pick the
best-performing experiment and start using the model it put forth.</p>
<h2 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Throughout this guide, we transformed a Jupyter Notebook into a codified
pipeline for reproducible experiments. We used Papermill to parameterize our
notebook so that we could run it with a single command and then created a
pipeline in DVC to run that command for us.</p>
<admon type="info">
<p>The result of following the guide can be found in
<a href="https://github.com/iterative/example-pokemon-classifier/tree/papermill-dvc" target="_blank" rel="nofollow noopener noreferrer">the <code>papermill-dvc</code> branch of the example project</a>.</p>
</admon>
<p>With our DVC pipeline tracking and versioning every experiment, we can discover
which combination of code, data, and parameters provides the best results.
Comparing experiments is especially easy when using the experiment table in the
DVC extension for Visual Studio Code.</p>
<p>From this point onwards, we can still make a few improvements to our pipeline.
For one, we could leverage DVC to generate our plots rather than render them as
images from our notebook. This would allow us to compare experiments visually in
a similar manner to how DVC can visualize an experiments table. To learn more
about this,
<a href="https://dvc.org/doc/command-reference/plots" target="_blank" rel="nofollow noopener noreferrer">please refer to the docs</a>.</p>
<p>Another improvement would be to break up our single-stage pipeline into
different stages with coherent units of code (e.g., preprocess, train, and
evaluate). Our current implementation runs the entire notebook for every single
experiment, even though the data preprocessing doesn’t change between
experiments. With a multi-stage pipeline, DVC could track changes to the in- and
outputs for every stage and automatically determine which stages it can skip
because nothing has changed. This saves time and resources, especially in
computationally heavy projects.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc dag</span>
</span>+-------------------+
| data/external.dvc |
+-------------------+
*
*
*
+-----------------+
| data_preprocess |
+-----------------+
*
*
*
+-----------+
| data_load |
+-----------+
*
*
*
+-------+
| train |
+-------+
*
*
*
+----------+
| evaluate |
+----------+</code></pre></div>
<p>If you want to learn how to transform a notebook into a multi-stage pipeline, I
recommend taking a look at our course:
<a href="https://learn.dvc.org/course/data-scientist-path" target="_blank" rel="nofollow noopener noreferrer">Iterative tools for Data Scientists and Analysts</a>.
It is completely free to follow, and module 3 covers this process in depth.</p>
<p>We might also write a future guide about this, so let us know if you would be
interested in seeing this content. Make sure to join
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">our Discord server</a> if you have any questions or want to
discuss this post further!</p>https://dvc.org/blog/october-heartbeathttps://dvc.org/blog/october-heartbeatThu, 20 Oct 2022 00:00:00 GMT<p>Welcome to October! As the days grow shorter or longer depending on your
hemisphere, we bring you the latest and greatest from the Iterative Community.</p>
<h1 id="in-ai-news" style="position:relative;">In AI News<a href="#in-ai-news" aria-label="in ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="andrew-ng-at-intels-innovation-conference---democratizing-ai-through-data-centric-ai" style="position:relative;">Andrew Ng at Intel's Innovation Conference - Democratizing AI through Data-Centric AI<a href="#andrew-ng-at-intels-innovation-conference---democratizing-ai-through-data-centric-ai" aria-label="andrew ng at intels innovation conference democratizing ai through data centric ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/G3MaIMrR6Ms?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>At
<a href="https://www.intel.com/content/www/us/en/events/on-event-series/innovation.html" target="_blank" rel="nofollow noopener noreferrer">Intel’s Innovation</a>
conference, <a href="https://www.linkedin.com/in/andrewyng/" target="_blank" rel="nofollow noopener noreferrer"><strong>Andrew Ng</strong></a> gave a
keynote on democratizing AI. He posits that while large companies have embraced
AI, most smaller companies outside of the consumer-based domains still struggle.
He provides two main reasons for this: small datasets and customization.</p>
<p>According to Ng, data-centric AI will be the key to unlocking that potential,
forcing a paradigm shift away from code-centric AI. In this scenario, people
could take mostly ready-built ML tech and focus on the data to ensure it
captures all necessary domain knowledge.</p>
<p>For example, two companies that produce cornflakes and medication could take the
same ML model and train it on their respective datasets. As long as they have
the right tools and practices and provide a domain representative dataset, the
same model can reproduce effective results. If you want to see some of the tools
Ng uses, make sure to check out his keynote.</p>
<p>What do you think? Does the average data scientist need a different set of
skills in the near future? Are you in one of these smaller industries that are
starting to embrace AI? We'd love to read your thoughts! Join us in our
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">discussion of this topic on Discord</a>!</p>
<h2 id="blueprint-for-an-ai-bill-of-rights" style="position:relative;">Blueprint for an AI Bill of Rights<a href="#blueprint-for-an-ai-bill-of-rights" aria-label="blueprint for an ai bill of rights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/826ad49e017a3af5984d9c6cf494e987/39600/blue-print.png" alt="Blueprint for an AI Bill of Rights" title="White House Blueprint for an AI Bill of Rights" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
If you will recall from
<a href="https://iterative.ai/blog/september-22-heartbeat#european-ai-act" target="_blank" rel="nofollow noopener noreferrer">last month's Heartbeat</a>
we called to your attention the EU AI Act. This act proposes new rules that
would require that open source developers adhere to guidelines across a spectrum
of categories including risk management, data governance, technical
documentation and transparency, standards and accuracy, and cyber security. Not
to be outdone, the US White House declared a
<a href="https://www.whitehouse.gov/ostp/ai-bill-of-rights/" target="_blank" rel="nofollow noopener noreferrer">Blue Print for an AI Bill of Rights</a>.
<a href="https://www.whitehouse.gov/ostp/" target="_blank" rel="nofollow noopener noreferrer">The White House Office of Science and Technology Policy (OSTP)</a>
has defined 5 categories for these rights:</p>
<ol>
<li>Safe and Effective Systems</li>
<li>Algorithmic Discrimination Protection</li>
<li>Data Privacy</li>
<li>Notice and Explanation</li>
<li>Human Alternatives, Consideration, and Fallback</li>
</ol>
<p>There's definitely some overlap here with the EU AI Act and some catching up
with Data Privacy in the mix. There's lots to unpack, compare, and contrast on
scope and philosophy between the two. It's nice to see that major attention is
given to these issues.</p>
<p>We could think of the relationship between AI rights and Andrew Ng's talk in the
sense of the AI space maturing. To Andrew Ng's points, as we move from the
frenzied all-important model development to an understanding of the need for a
data-centric approach and this democratization, we are changing the focus to
enable us to adequately address these hard and important issues. Improving the
efficiency of tooling will help with this too. That's why we are here.</p>
<p>What do you think? Do the efficiencies we are gaining open up room for improved
time/attention to bake protections into the process or am I too hopeful? Head to
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and share your thoughts!</p>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e798beca6e65dd684e680b7d07318b57/03346/hydra.jpg" alt="DVC-Hydra integration" title="DVC-Hydra integration" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>AI
generated image of rainbow feathered dragon (DeeVee + Hydra)</em></p>
<h2 id="dvc-hydra-integration" style="position:relative;">DVC-Hydra Integration<a href="#dvc-hydra-integration" aria-label="dvc hydra integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Did you hear? DVC has a new integration with Hydra. Now you can use Hydra
composition to configure your DVC experiments. You can also apend and remove
parameters on the fly as well as do a grid search of parameters. Random search
functionlity is coming,
<a href="https://github.com/iterative/dvc/issues/8258" target="_blank" rel="nofollow noopener noreferrer">weigh in on the issue here.</a> Find
out more in <a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia's</strong></a>
<a href="https://iterative.ai/blog/dvc-hydra-integration" target="_blank" rel="nofollow noopener noreferrer">blog post</a>.</p>
<h2 id="october-meetup" style="position:relative;">October Meetup<a href="#october-meetup" aria-label="october meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you missed the October Meetup with
<a href="https://www.linkedin.com/in/nadia-nahar-iit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nadia Nahar</strong></a> presenting her
team's research on <em>Collaboration Challenges in Building ML-Enabled Systems:
Communication, Documentation, Engineering, and Process</em> don't worry, there's a
video! Catch it below!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/FKdVSNfnD_M?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="november-meetup" style="position:relative;">November Meetup<a href="#november-meetup" aria-label="november meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Join us for our next meetup on November 16th. We will have
<a href="https://www.linkedin.com/in/dim25/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmytro Filatov</strong></a> of
<a href="https://deepxhub.com/" target="_blank" rel="nofollow noopener noreferrer">DeepX</a> presenting <em>Continous Computer Vision with DVC
and CML</em> and <a href="https://www.linkedin.com/in/jelle-bouwman/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jelle Bouwman</strong></a>
demoing Iterative Studio Model Registry. Be sure to register
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289088542/" target="_blank" rel="nofollow noopener noreferrer">here!</a></p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289088542/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Continuous Computer Vision with DVC and CML plus Iterative Studio Model Registry Demo</h4>
<div class="elp-description">Join us on November 16th. Come see the possibilities with DVC, CML, and Iterative Studio Model Registry!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-10-20/meetup-6b29c88388fd183f67d88dad40d5c671.png" alt="Continuous Computer Vision with DVC and CML plus Iterative Studio Model Registry Demo">
</div>
</a>
</section>
<p></p>
<h2 id="alex-kim---cicd-for-machine-learning-webinar-with-odsc" style="position:relative;">Alex Kim - CI/CD for Machine Learning Webinar with ODSC<a href="#alex-kim---cicd-for-machine-learning-webinar-with-odsc" aria-label="alex kim cicd for machine learning webinar with odsc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Join <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> on November 30th with
<a href="https://opendatascience.com/" target="_blank" rel="nofollow noopener noreferrer">ODSC</a> to learn about CI/CD for Machine Learning.
This webinar shares how CML is a project to help ML and data science
practitioners automate their ML model training and model evaluation, using best
practices and tools from software engineering, such as GitLab CI/CD (as well as
GitHub Actions and BitBucket Pipelines). The idea is to automatically train your
model and test it in a production-like environment every time your data or code
changes. In this talk, you'll learn how to:</p>
<ul>
<li>Automatically allocate cloud instances (AWS, Azure, GCP) to train ML models.
And automatically shut the instance down when training is over</li>
<li>Automatically generate reports with graphs and tables in pull/merge requests
to summarize your model's performance, using any visualization library</li>
<li>Transfer data between cloud storage and computing instances with DVC</li>
<li>Customize your automation workflow with GitLab CI/CD</li>
</ul>
<p>Sign up for the talk
<a href="https://register.gotowebinar.com/register/6817359546805649932?utm_campaign=Webinars&utm_source=Community&utm_medium=Community&utm_content=Webinar%2030th%20Nov%202022" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e98308d2e1d9eae00586b4b24266e708/39600/alex-kim.png" alt="Alex Kim ODSC webinar" title="Alex Kim ODSC webinar" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Alex
Kim webinar CI/CD for Machine Learning for ODSC
(<a href="https://register.gotowebinar.com/register/6817359546805649932?utm_campaign=Webinars&utm_source=Community&utm_medium=Community&utm_content=Webinar%2030th%20Nov%202022" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="its-hacktoberfest" style="position:relative;">It's Hacktoberfest!<a href="#its-hacktoberfest" aria-label="its hacktoberfest permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/47bd367f4f623fc9f46c4ba7fc706e51/39600/hacktoberfest.png" alt="Iterative Hacktoberfest" title="Iterative Hacktoberfest" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
It's Hacktoberfest month and we are participating! Find out all the information
in <a href="https://twitter.com/mertbozkirr" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir's</strong></a>
<a href="https://iterative.ai/blog/iterative-x-hacktoberfest-2022" target="_blank" rel="nofollow noopener noreferrer">blog post</a>. But if
you just want to jump in, find all the open HackToBerFest issues
<a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">here.</a>
Follow along in the <code>#hacktoberfest</code> channel in Discord to keep up to date for
the rest of the month and be sure to read next month's Heartbeat to learn of the
contributions!</p>
<h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/ivan-longin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivan Longin</strong></a> joins us as a Senior
Software Engineer on the Iterative Studio team from Zadar, Croatia. When Ivan's
not working he likes to spend time doing outdoor activities, swimming in good
weather, and or just walking or often running after his one-year-old! Been there
three times over! ❤️ Welcome Ivan!</p>
<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This month was full of great content. We wanted to give a shout-out to all of
it, so we are trying out a more abbreviated list.<br>
Thanks to all these amazing Community members that are sharing their knowledge!
🚀</p>
<h2 id="dvc" style="position:relative;">DVC<a href="#dvc" aria-label="dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="data-management" style="position:relative;">Data management<a href="#data-management" aria-label="data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li><a href="https://towardsdatascience.com/data-and-machine-learning-model-versioning-with-dvc-34fdadd06b15" target="_blank" rel="nofollow noopener noreferrer">Data and Machine Learning Model Versioning with DVC</a>
by <a href="https://www.linkedin.com/in/marcellusrubenwinastwan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ruben Winastwan</strong></a>
Nice visuals! ⭐️</li>
<li>A great guide from <a href="https://www.linkedin.com/in/wmeints/" target="_blank" rel="nofollow noopener noreferrer"><strong>Willem Meints</strong></a> -
<a href="https://fizzylogic.nl/2022/10/14/managing-machine-learning-datasets-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Managing Machine Learning Datasets with DVC.</a>
Also, find his
<a href="https://twitter.com/willem_meints/status/1580898467097980932?s=20&t=SD8k9hZ7ygzEFlGBNTyJSA" target="_blank" rel="nofollow noopener noreferrer">Tweets on Twitter</a></li>
<li><a href="https://www.linkedin.com/in/jorgehabibnamour/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Namour</strong></a> will give a
Webinar on
<a href="https://www.facebook.com/facet.unt/posts/pfbid03ABqt5v1tUhRJJowSZgvjaYdFYfyirxGu9aph6LstYu8rVPJsYeuTBPio9srMn4hl" target="_blank" rel="nofollow noopener noreferrer">Tracking Data with Git + DVC</a>
en Español on October 27th
<a href="https://www.youtube.com/watch?v=pYLEf9FsFic" target="_blank" rel="nofollow noopener noreferrer">at this YouTube link.</a></li>
<li>Some GitHub goodness:
<a href="https://github.com/datarootsio/tutorial-mlops" target="_blank" rel="nofollow noopener noreferrer">MLOps - tutorial with DVC, MLFlow, and Pycaret</a>
from <a href="https://github.com/murilo-cunha" target="_blank" rel="nofollow noopener noreferrer"><strong>Murilo Cunha</strong></a>,
<a href="https://github.com/vspara" target="_blank" rel="nofollow noopener noreferrer"><strong>vspara</strong></a>, and
<a href="https://github.com/virginiemar" target="_blank" rel="nofollow noopener noreferrer"><strong>virginiemar</strong></a></li>
<li>Updated Udemy course that includes DVC -
<a href="https://www.udemy.com/course/complete-mlops-bootcamp-from-zero-to-hero-in-python-2022/?utm_source=aff-campaign&utm_medium=udemyads&LSNPUBID=McqLy3Lfq44&ranMID=47901&ranEAID=McqLy3Lfq44&ranSiteID=McqLy3Lfq44-MTrInsWY4oEt0kDxUzExAg" target="_blank" rel="nofollow noopener noreferrer">Complete MLOps Bootcamp | From Zero to Hero in Python 2022</a></li>
<li><a href="https://mathdatasimplified.com/2022/10/07/how-to-version-control-your-data-and-models-with-dvc/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-version-control-your-data-and-models-with-dvc" target="_blank" rel="nofollow noopener noreferrer">How to Version Control Your Data and Models with DVC</a>
(<strong>Video included</strong>) by
<a href="https://www.linkedin.com/in/khuyen-tran-1401/" target="_blank" rel="nofollow noopener noreferrer"><strong>Khuyen Tran</strong></a> Dig the DVC
color-themed command line! 🤩</li>
<li>NLP and CV with DVC!
<a href="https://pub.towardsai.net/from-unet-to-bert-extraction-of-important-information-from-scientific-papers-ef0f737e45e9" target="_blank" rel="nofollow noopener noreferrer">From UNet to BERT: Extraction of Important Information from Scientific Papers</a>
by <a href="https://www.linkedin.com/in/eman-shemsu-83473684/" target="_blank" rel="nofollow noopener noreferrer"><strong>Eman Shemsu</strong></a></li>
<li><a href="https://minimin2.tistory.com/m/185" target="_blank" rel="nofollow noopener noreferrer">[MLOps] How to use DVC (Data Version Control) data versioning</a>
in Korean 🇰🇷 by Minimin2</li>
</ul>
<h3 id="data-pipelines" style="position:relative;">Data Pipelines<a href="#data-pipelines" aria-label="data pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>Great guide from
<a href="https://www.linkedin.com/in/deborahmesquita/" target="_blank" rel="nofollow noopener noreferrer"><strong>Déborah Mesquita</strong></a> -
<a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" target="_blank" rel="nofollow noopener noreferrer">The ultimate guide to building maintainable Machine Learning pipelines using DVC</a>
(<strong>Video Included</strong>) ⭐️</li>
<li>Also from <a href="https://www.linkedin.com/in/khuyen-tran-1401/" target="_blank" rel="nofollow noopener noreferrer"><strong>Khuyen Tran</strong></a>:
<a href="https://towardsdatascience.com/create-a-maintainable-data-pipeline-with-prefect-and-dvc-1d691ea5bcea" target="_blank" rel="nofollow noopener noreferrer">Create a Maintainable Data Pipeline with Prefect and DVC</a></li>
</ul>
<h3 id="experimentation" style="position:relative;">Experimentation<a href="#experimentation" aria-label="experimentation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>In-depth tutorial covering Data Management, Pipelines and Experimentation with
DVC <a href="https://www.linkedin.com/in/givashkevich/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gleb Ivashkevich</strong></a> -
<a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" target="_blank" rel="nofollow noopener noreferrer">Creating Reproducible data Science Workflows with DVC</a>
⭐️</li>
<li><a href="https://iblog.ridge-i.com/entry/2022/10/11/102033" target="_blank" rel="nofollow noopener noreferrer">Data Version Control (DVC): Beginner's Guide</a>
by <a href="https://www.linkedin.com/in/ajmain-inqiad-alam/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ajmain Inqiad Alam</strong></a></li>
</ul>
<h3 id="other-mentions" style="position:relative;">Other mentions<a href="#other-mentions" aria-label="other mentions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>There is now a
<a href="https://en.wikipedia.org/w/index.php?title=Data_Version_Control&diff=1114227867&oldid=1114227707" target="_blank" rel="nofollow noopener noreferrer"><strong>DVC Wikipedia page!</strong></a></li>
<li>Great discussion around challenges in Machine learning from
<a href="https://medium.com/@dvsamchuk" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmytro Samchuk</strong></a> -
<a href="https://medium.com/@dvsamchuk/machine-learning-done-right-in-your-business-130acd3a093e" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Done Right in Your Business.</a></li>
</ul>
<h2 id="cml" style="position:relative;">CML<a href="#cml" aria-label="cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>CML in research! 🤩
<a href="https://arxiv.org/abs/2209.11453" target="_blank" rel="nofollow noopener noreferrer">A Preliminary Investigation of MLOps Practices in GitHub</a>,
<a href="https://arxiv.org/pdf/2209.11453.pdf" target="_blank" rel="nofollow noopener noreferrer">PDF</a> by
<a href="https://www.linkedin.com/in/fcalefato/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabio Calefato</strong></a>,
<a href="https://www.linkedin.com/in/lanubile/" target="_blank" rel="nofollow noopener noreferrer"><strong>Filippo Lanubile</strong></a>, and
<a href="https://www.linkedin.com/in/luigi-quaranta-007a6112a/" target="_blank" rel="nofollow noopener noreferrer"><strong>Luigi Quaranta</strong></a></li>
<li>Part III in <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Upson's</strong>:</a> series
<a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-iii-f56a29c428f3?source=rss----72ea48936cdc---4" target="_blank" rel="nofollow noopener noreferrer">MLOps for Conversational AI with Rasa, DVC, and CML (Part III)!</a></li>
<li><a href="https://mail-redir.mention.com/api/url?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczpcL1wvZ2l0aHViLmNvbVwvcjBmMVwvZGF0YXNjaWVuY2VcL2NvbW1pdFwvNzMzMTU0YTdjYWJlOGY2MDRlMmMwYzQwOWI2NzRhY2QyODg3NWJhMCIsImFjY291bnRfaWQiOjEwMDMyNDIsImFsZXJ0X2lkIjoyNDM1MTgwLCJzb3VyY2VfaWQiOjY3LCJtZW50aW9uX2lkIjoxNDAzNzIzOTkwMzV9.AQcSYPdGzKBJemSgTDlyPcSeWL7dJTIlULRJaDqDVRg" target="_blank" rel="nofollow noopener noreferrer">Zen ML adds CML to its Awesome Data Science with Python list.</a>
😎</li>
<li><a href="https://www.linkedin.com/in/alessandro-paticchio/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alessandro Paticchio</strong></a>
(Casavo)
<a href="https://medium.com/casavo/using-ai-to-automatically-estimate-the-status-of-a-fa%C3%A7ade-c84c2a90549e" target="_blank" rel="nofollow noopener noreferrer">Using AI to automatically estimate the status of a façade.</a>
⭐️</li>
<li><a href="https://cmtech.live/2022/08/31/ci-cd-for-machine-learning-model-training-with-github-actions-by-zoumana-keita-aug-2022/" target="_blank" rel="nofollow noopener noreferrer">CI/CD for Machine Learning Model Training with GitHub Actions</a>
by <a href="https://www.linkedin.com/in/zoumana-keita/" target="_blank" rel="nofollow noopener noreferrer"><strong>Zoumana Keita</strong></a></li>
</ul>
<h2 id="mlem" style="position:relative;">MLEM<a href="#mlem" aria-label="mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://www.instagram.com/tv/Cjnl8CuK2K0/" target="_blank" rel="nofollow noopener noreferrer">MLEM Instagram</a>. If you're on IG,
follow <a href="https://www.instagram.com/the_ai_dot/" target="_blank" rel="nofollow noopener noreferrer">the_ai_dot</a> for AI & ML New,
Tools & Libraries</li>
</ul>
<h2 id="️-tweet-love" style="position:relative;">❤️ Tweet Love<a href="#%EF%B8%8F-tweet-love" aria-label="️ tweet love permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>I had a really hard time choosing this month, but I was excited to see this
Tweet from <a href="https://twitter.com/nsorros" target="_blank" rel="nofollow noopener noreferrer"><strong>Nick Sorros</strong></a> announcing the post
from his colleague <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer">Matt Upson</a>.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">A little belated but neverthless hugely interesting post by my co founders <a href="https://twitter.com/m_a_upson">@m_a_upson</a> in which he touches on some core tools we use at Mantis like <a href="https://twitter.com/DVCorg">@DVCorg</a>, <a href="https://twitter.com/Rasa_HQ">@Rasa_HQ</a> and continuous machine learning.<br><br>It comes with code 💻 so you can take some of what you will read and use 🚀 <a href="https://t.co/PHgLXtvckz">https://t.co/PHgLXtvckz</a></p>— Nick Sorros (@nsorros) <a href="https://twitter.com/nsorros/status/1571844138575843331">September 19, 2022</a></blockquote>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/iterative-x-hacktoberfest-2022https://dvc.org/blog/iterative-x-hacktoberfest-2022Tue, 11 Oct 2022 00:00:00 GMT<p>Hacktoberfest is DigitalOcean’s annual event that encourages people to
contribute to open source throughout October. Hacktoberfest is all about giving
back to the community by contributing to open-source projects. The main point of
Hacktoberfest is encouraging new open-source contributors whether you’re a
seasoned contributor or looking for projects to contribute to for the first
time, you’re welcome to participate!</p>
<h2 id="what-is-iterative" style="position:relative;">What is Iterative<a href="#what-is-iterative" aria-label="what is iterative permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Iterative is a remote-first team on a mission to solve the complexity of
managing datasets, ML Infrastructure, and ML model lifecycle management. It was
started in 2018 by a data scientist and an engineer to fill in the gaps in the
machine learning to production. Presently Iterative is growing pretty fast,
adoption of the Iterative tools has significantly increased, and we have our
contributors to thank (more than 300 in both code and docs) for developing open
source projects such as DVC, CML, and MLEM with us.</p>
<p align="center">
<img src="https://media.giphy.com/media/wIVA0zh5pt0G5YtcAL/giphy.gif" alt="animated">
</p>
<h2 id="quick-start" style="position:relative;">Quick Start<a href="#quick-start" aria-label="quick start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>Sign up for Hacktoberfest <a href="https://hacktoberfest.com/auth/" target="_blank" rel="nofollow noopener noreferrer">here</a></li>
<li>Find all the Hacktoberfest issues
<a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">here</a></li>
<li>Read the contribution guideline (<a href="https://dvc.org/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">DVC</a>,
<a href="https://cml.dev/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">CML</a>,
<a href="https://mlem.ai/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>)</li>
<li>Join our <a href="https://discord.gg/5j3uvSnzXb" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest Discord channel</a> and
ask any questions</li>
<li>Create a pull request on the related GitHub repository</li>
</ul>
<h2 id="how-to-participate" style="position:relative;">How to Participate<a href="#how-to-participate" aria-label="how to participate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The most exciting part about being involved in the open-source community is that
no matter how small or big your contributions are, the community will welcome
your efforts and collaborate with you positively, sharing feedback and
expressing gratitude.</p>
<p>If you haven’t started your Hacktoberfest challenge yet, it is just the right
time; you have 4 weeks left to submit PRs and get your swag! Here are some
important details:</p>
<ul>
<li>Hacktoberfest is open to everyone in the global community</li>
<li>You can sign up anytime between October 1 and October 31. Make sure to sign up
on the <a href="https://hacktoberfest.com/" target="_blank" rel="nofollow noopener noreferrer">official Hacktoberfest website</a> for your
PRs to count</li>
<li>Pull requests can be made in
any <a href="https://github.com/topics/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">GitHub</a> project that’s
participating in Hacktoberfest (look for the “Hacktoberfest” topic)</li>
<li>Project maintainers must accept your pull/merge requests for them to count
toward your total</li>
<li>Have 4 pull/merge requests accepted between October 1 and October 31 to
complete Hacktoberfest</li>
</ul>
<p>And the special addition from the Iterative team:</p>
<ul>
<li>Look through the list
of <a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">Iterative Hacktoberfest tickets</a>.</li>
<li>Make a PR to repositories and get our stickers.</li>
<li>Close two issues for Iterative and get a special edition T-shirt.</li>
</ul>
<h3 id="important-rules" style="position:relative;">Important Rules<a href="#important-rules" aria-label="important rules permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>Your pull/merge requests must be within the bounds of Hacktoberfest</li>
<li>Your pull/merge requests must not be spammy</li>
<li>Your pull/merge requests must be in a repo tagged with the “Hacktoberfest”
topic, or be labeled as “Hacktoberfest-accepted”</li>
<li>Your pull/merge requests must not be labeled as “invalid”</li>
<li>Avoid submitting low-quality pull/merge requests. More details can be found
<a href="https://hacktoberfest.com/participation/#:~:text=AVOID%20SUBMITTING%20LOW%2DQUALITY%20PULL/MERGE%20REQUESTS." target="_blank" rel="nofollow noopener noreferrer">here</a></li>
</ul>
<p>At Iterative our mission is to deliver the best developer experience for machine
learning teams by creating an ecosystem of open, modular ML tools. Our tools are
built for developers, by developers and we need help from the global -
open-source community - to deliver this mission!</p>
<p>For all of us who have a heart for open source — let’s discuss, contribute,
learn, take the technologies forward and build something great together!</p>
<p>Happy hacking!</p>
<p align="center">
<img src="https://media.giphy.com/media/LcfBYS8BKhCvK/giphy.gif" alt="animated">
</p>
<hr>
<p>We are happy to hear from you <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>.
Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too!</p>https://dvc.org/blog/dvc-hydra-integrationhttps://dvc.org/blog/dvc-hydra-integrationTue, 04 Oct 2022 00:00:00 GMT<p><a href="https://hydra.cc/" target="_blank" rel="nofollow noopener noreferrer">Hydra</a> has become one of the most popular tools for managing
the configuration of research projects and complex applications, given its
ability for composing and overwriting configuration both from the command line
and from files.</p>
<p>These features are a great complement to many of the values provided as part of
DVC:
<a href="https://dvc.org/doc/start/data-management/data-versioning" target="_blank" rel="nofollow noopener noreferrer">data versioning</a>,
<a href="https://dvc.org/doc/start/data-management/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">data pipelines</a>, and
<a href="https://dvc.org/doc/start/experiment-management/experiments" target="_blank" rel="nofollow noopener noreferrer">experiment management</a>.</p>
<p>Therefore, we decided to tackle this by providing a deeper integration: using
Hydra internals inside DVC and allowing users to benefit from the best of both
tools.</p>
<p>In this post, we are going to provide an overview of the benefits that users of
both tools can get from the integration.</p>
<h1 id="what-dvc-users-gain-from-the-integration" style="position:relative;">What DVC users gain from the integration<a href="#what-dvc-users-gain-from-the-integration" aria-label="what dvc users gain from the integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="use-hydra-composition-to-configure-dvc-experiments" style="position:relative;">Use Hydra composition to configure DVC experiments<a href="#use-hydra-composition-to-configure-dvc-experiments" aria-label="use hydra composition to configure dvc experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-10-04/deevee-band-1a6bd99d9764245325f931b7e987907a.mp4" type="video/mp4">
Your browser does not support the video tag. </video></p>
<p>DVC didn’t provide a way of composing configuration from multiple sources, which
can be very convenient in several use cases, like switching between different
model architectures. The Hydra docs provide a great overview of
<a href="https://hydra.cc/docs/patterns/configuring_experiments/" target="_blank" rel="nofollow noopener noreferrer">common patterns</a> where
this composition is useful.</p>
<p>DVC can now use Hydra Composition to configure entire DVC pipelines and run DVC
experiments.</p>
<p>You can learn more about this feature on the
<a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra Composition</a>
page of the user guide.</p>
<h2 id="appending-and-removing-parameters-on-the-fly" style="position:relative;">Appending and removing parameters on the fly<a href="#appending-and-removing-parameters-on-the-fly" aria-label="appending and removing parameters on the fly permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC supported a limited functionality for modifying configuration using
<code>exp run --set-param</code>.</p>
<p><code>--set-param</code> can now be used with
<a href="https://hydra.cc/docs/advanced/override_grammar/basic/" target="_blank" rel="nofollow noopener noreferrer">Hydra’s Basic Override syntax</a>
supporting new operations like <em>Appending</em> and <em>Removing</em> parameters for
arbitrary parameter files.</p>
<p>When Hydra’s composition is enabled, the same syntax can be used to override
values in the
<a href="https://hydra.cc/docs/tutorials/basic/your_first_app/config_groups/" target="_blank" rel="nofollow noopener noreferrer">Config Groups</a>
and
<a href="https://hydra.cc/docs/tutorials/basic/your_first_app/defaults/" target="_blank" rel="nofollow noopener noreferrer">Defaults list</a>.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Append new param</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'+trainer.gradient_clip_val=0.001'</span>
</span><span class="token comment"># Remove existing param</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'~model.dropout'</span>
</span><span class="token comment"># Target arbitrary files</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'train_config.json:+train.weight_decay=0.001'</span>
</span><span class="token comment"># Modify the defauls list</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token string">'train/model=efficientnet'</span></span></code></pre></div>
<h2 id="grid-search-of-parameters" style="position:relative;">Grid Search of parameters<a href="#grid-search-of-parameters" aria-label="grid search of parameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC <code>exp run</code> only supported
<a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments#the-experiments-queue" target="_blank" rel="nofollow noopener noreferrer">queuing</a>
a single experiment at a time.</p>
<p><code>exp run --set-param</code> can now use Hydra's
<a href="https://hydra.cc/docs/advanced/override_grammar/extended/#choice-sweep" target="_blank" rel="nofollow noopener noreferrer">Choice</a>
and
<a href="https://hydra.cc/docs/advanced/override_grammar/extended/#range-sweep" target="_blank" rel="nofollow noopener noreferrer">Range</a>
syntax for adding multiple experiments to the queue and performing a grid
search:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'model.learning_rate=range(0.01, 0.5, 0.01)'</span> <span class="token parameter variable">--queue</span>
</span>Queueing with "{'params.yaml': ['model.learning_rate=0.01']}".
Queued experiment '84e89be' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.02']}".
Queued experiment 'd7708fa' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.03']}".
Queued experiment '5494d5c' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.04']}".
Queued experiment '2e16c1f' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.05']}".
Queued experiment '7c7a615' for future execution.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc queue start</span></span></code></pre></div>
<h1 id="what-hydra-users-gain-from-the-integration" style="position:relative;">What Hydra users gain from the integration<a href="#what-hydra-users-gain-from-the-integration" aria-label="what hydra users gain from the integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="git-based-versioning-and-caching" style="position:relative;">Git-based versioning and caching<a href="#git-based-versioning-and-caching" aria-label="git based versioning and caching permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Hydra relies on
<a href="https://hydra.cc/docs/configure_hydra/workdir/" target="_blank" rel="nofollow noopener noreferrer">folder-based versioning</a> for
managing multiple runs.</p>
<p>By using the DVC and Hydra integration, you can version the runs using
<a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">DVC experiments</a>,
enabling a more
<a href="https://dvc.org/doc/user-guide/experiment-management/persisting-experiments" target="_blank" rel="nofollow noopener noreferrer">git-friendly</a>
workflow and adding
<a href="https://dvc.org/doc/user-guide/experiment-management#run-cache-automatic-log-of-stage-runs" target="_blank" rel="nofollow noopener noreferrer">caching</a>
capabilities so runs won’t be unnecessarily recomputed.</p>
<h2 id="multi-step-pipelines-and-language-agnostic" style="position:relative;">Multi-step pipelines and Language Agnostic<a href="#multi-step-pipelines-and-language-agnostic" aria-label="multi step pipelines and language agnostic permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Hydra's scope is limited to a single <strong>Python script</strong> wrapped with the
<code>@hydra.main</code> decorator.</p>
<p>By using the
<a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">DVC and Hydra integration</a>,
you can use Hydra to configure entire
<a href="https://dvc.org/doc/start/data-management/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipelines</a>, which
can be composed of <strong>multiple</strong> <strong>stages</strong> running <strong>arbitrary</strong> <strong>commands.</strong></p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">featurize</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/featurization.py data/prepared data/features
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/prepared
<span class="token punctuation">-</span> src/featurization.py
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> featurize.max_features
<span class="token punctuation">-</span> featurize.ngrams
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/features
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/train.py data/features model.pkl
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/features
<span class="token punctuation">-</span> src/train.py
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> train.min_split
<span class="token punctuation">-</span> train.n_est
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> model.pkl</code></pre></div>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'featurize.max_features=200'</span> <span class="token parameter variable">-S</span> <span class="token string">'train.n_est=100'</span>
</span>Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Running stage 'train':
> python src/train.py data/features model.pkl</code></pre></div>
<h1 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Starting with DVC <code>2.25.0</code>, you can use the features described in this post to
efficiently combine Hydra and DVC in your projects.</p>
<p>To get a deeper understanding of all the parts involved, you can read the
<a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra Composition</a>
page of the DVC user guide.</p>https://dvc.org/blog/september-22-heartbeathttps://dvc.org/blog/september-22-heartbeatMon, 19 Sep 2022 00:00:00 GMT<details>
<p>This month’s image inspiration is community member
<a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a>. Sami has
contributed to DVC in the past and most recently to the DVC and CML teams with
regard to extending our remote experimenting features to include running
experiments in parallel, which you can check out
<a href="https://github.com/iterative/dvc/commit/c7d63e8c59819592d2a749ab721fe5c85379fece" target="_blank" rel="nofollow noopener noreferrer">here</a>
and
[here](<a href="https://github.com/iterative/terraform-provider-iterative/compare/master...sjawhar:terraform-provider-iterative:feature/nfs-volume" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/terraform-provider-iterative/compare/master…sjawhar:terraform-provider-iterative:feature/nfs-volume</a>.
Look out for him speaking at a Meetup soon on this topic!</p>
<p>Last year Sami presented at one of our
<a href="https://www.youtube.com/watch?v=DxZdWq3Weng" target="_blank" rel="nofollow noopener noreferrer">Office Hours meetups</a> on “What is
an experiment?” More specifically he asked, at what level of granularity do you
experiment and when do you share with your team? He shared great ideas, tips,
and code in the session and spurred a great discussion with other community
members. We look forward to the next Meetup!</p>
<summary>✨Image Inspo✨</summary>
</details>
<details>
Our Community has grown and so has the monthly Heartbeat! To help you better navigate to the content you desire, use the following ToC:
<h1 id="table-of-contents" style="position:relative;">Table of contents<a href="#table-of-contents" aria-label="table of contents permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<ol>
<li><a href="#from-greater-aiml-community">From Greater AI/ML Community</a>
<ol>
<li><a href="#meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles">Meta Is Building an AI to Fact-Check Wikipedia</a></li>
<li><a href="#european-ai-act">EU AI Act</a></li>
<li><a href="#pulse-check">💗Pulse Check</a></li>
</ol>
</li>
<li><a href="#iterative-community-news">Iterative Community News</a>
<ol>
<li><a href="#francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry">Story of GTO-based model registry</a></li>
<li><a href="#mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml">MLOps Course at University of Denmark</a></li>
<li><a href="#goku-mohandas---made-with-ml-mlops-interactive-course">Made With ML MLOps Interactive Course</a></li>
<li><a href="#adri%C3%A0-romero---youtube-review-of-dvc">Lakera Review of DVC (video)</a></li>
<li><a href="#sydney-firmin---reproducibility-replicability-and-data-science">Reproducibility, Replicability, and Data Science</a></li>
<li><a href="#iterative-xkcd-lore">Iterative xkcd lore</a></li>
</ol>
</li>
<li><a href="#company-news">Company News</a>
<ol>
<li><a href="#mlem-mlem-mlem-this-dog-food-is-good">We are eating our own dog food</a></li>
<li><a href="#alex-kim-oreilley-mlops-course">New O'Reilly Course with Alex Kim</a></li>
<li><a href="#latam-ai">LATAM AI</a></li>
<li><a href="#new-hires">New Hires</a></li>
<li><a href="#open-positions">Open Positions</a></li>
<li><a href="#new-blog-posts">New Blog posts</a></li>
<li><a href="#upcoming-conferences">Upcoming Conferences</a></li>
</ol>
</li>
<li><a href="#tweet-love">Tweet Love</a></li>
</ol>
<summary>Table of Contents</summary>
</details>
<p>As the summer fades and we get revved up to finish off the year, we start the
September Heartbeat with some juicy food for thought AI topics.</p>
<p><img src="https://media.giphy.com/media/kPtv3UIPrv36cjxqLs/giphy.gif" alt="Will Ferrell Lol GIF by NBA"></p>
<h2 id="from-greater-aiml-community" style="position:relative;">From Greater AI/ML Community<a href="#from-greater-aiml-community" aria-label="from greater aiml community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles" style="position:relative;">Meta Is Building an AI to Fact-Check Wikipedia—All 6.5 Million Articles<a href="#meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles" aria-label="meta is building an ai to fact check wikipediaall 65 million articles permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c2295358e28c8014e45ea3b24ca41ee9/ab158/wikipedia.png" alt="Meta Fact-Checking Wikipedia" title="Meta Fact-checking Wikipedia Ai" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="http://twitter.com/vanessabramirez" target="_blank" rel="nofollow noopener noreferrer"><strong>Vanessa Bates Ramirez</strong></a> writes
<a href="https://singularityhub.com/2022/08/26/meta-is-building-an-ai-to-fact-check-wikipedia-all-6-5-million-articles/" target="_blank" rel="nofollow noopener noreferrer">an article</a>
in <a href="https://singularityhub.com" target="_blank" rel="nofollow noopener noreferrer">Singularity Hub</a> about Meta's plans to
fact-check Wikipedia. Under the premise of making Wikipedia more accurate,
<a href="https://about.facebook.com/?utm_source=meta.com&utm_medium=redirect" target="_blank" rel="nofollow noopener noreferrer">Meta</a>, in
conjunction with <a href="https://www.amazon.science/tag/alexa" target="_blank" rel="nofollow noopener noreferrer">Amazon Alexa.AI</a> and
<a href="https://openreview.net/pdf?id=qfTqRtkDbWZ" target="_blank" rel="nofollow noopener noreferrer">some University contributors</a> is
building an AI system trained on 4 million Wikipedia citations. The system
architecture made up of retrieval and verification engines, cross references not
only content, but specific figures to verify accuracy.</p>
<p>They’ve built an index of web pages that are chunked into passages and then
provide an accurate representation of the passage to train the model. Their aim
is to more accurately capture meaning as opposed to word pattern. From
<a href="https://twitter.com/Fabio_Petroni" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabio Petroni</strong></a>, Meta’s Fundamental AI
Research tech lead manager:</p>
<blockquote>
<p>[This index] is not representing word-by-word the passage, but the meaning of
the passage. That means that two chunks of text with similar meaning will be
represented in a very close position in the resulting n-dimensional space
where all these passages are stored.</p>
</blockquote>
<p>They hope to ultimately be able to suggest accurate sources and create a grading
system on accuracy.
<a href="https://verifier.sideeditor.com/" target="_blank" rel="nofollow noopener noreferrer">You can find a demo of the project, named Side, here</a>
to look at samples and go deeper into the research. They are looking for people
to give feedback on the quality of the system.</p>
<p>Vanessa brings up some great questions regarding this:</p>
<blockquote>
<p>If you imagine a not-too-distant future where everything you read on Wikipedia
is accurate and reliable, wouldn’t that make doing any sort of research a bit
too easy? There’s something valuable about checking and comparing various
sources ourselves, is there not? It was a big leap to go from paging through
heavy books to typing a few words into a search engine and hitting “Enter”; do
we really want Wikipedia to move from a research jumping-off point to a
gets-the-last-word source?</p>
</blockquote>
<p>To these I’d add, what’s Meta’s/Amazon Alexa's monetary motivation to do this
(because there always is one), and given past ethical infractions on Meta's part
( <a href="https://link.springer.com/article/10.1007/s43681-021-00068-x" target="_blank" rel="nofollow noopener noreferrer">1,</a>
<a href="https://www.abc.net.au/triplej/programs/hack/facebook-whistleblower-says-instagram-content-hurts-teens/13573020" target="_blank" rel="nofollow noopener noreferrer">2,</a>
<a href="https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election" target="_blank" rel="nofollow noopener noreferrer">3,</a>
<a href="https://www.buzzfeednews.com/article/craigsilverman/viral-fake-election-news-outperformed-real-news-on-facebook" target="_blank" rel="nofollow noopener noreferrer">4,</a>
and
<a href="https://www.theatlantic.com/technology/archive/2014/06/everything-we-know-about-facebooks-secret-mood-manipulation-experiment/373648/" target="_blank" rel="nofollow noopener noreferrer">5,</a>)
should we applaud this? Or is this collaboration with Universities a step in the
right direction?</p>
<h3 id="european-ai-act" style="position:relative;">European AI Act<a href="#european-ai-act" aria-label="european ai act permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d9923667ce38b6b7e88c76dda707f8d2/bbe0c/eu.jpg" alt="EU AI Act" title="EU AI Act t" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://twitter.com/Kyle_L_Wiggers" target="_blank" rel="nofollow noopener noreferrer"><strong>Kyle Wiggers</strong></a> reports on the EU's AI
Act and its potential ill effects on open source efforts in
<a href="https://techcrunch.com/2022/09/06/the-eus-ai-act-could-have-a-chilling-effect-on-open-source-efforts-experts-warn/" target="_blank" rel="nofollow noopener noreferrer">this piece</a>
in <a href="https://techcrunch.com" target="_blank" rel="nofollow noopener noreferrer">TechCrunch</a>. The proposed new rules would require
that open source developers adhere to guidelines across a spectrum of categories
including risk management, data governance, technical documentation and
transparency, standards and accuracy, and cyber security. Not a negligible list.</p>
<p>The article covers critiques of the Act from
<a href="https://www.brookings.edu/experts/alex-engler/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Engler</strong></a> of think tank
<a href="https://brookings.edu" target="_blank" rel="nofollow noopener noreferrer">Brookings</a> through
<a href="https://www.brookings.edu/blog/techtank/2022/08/24/the-eus-attempt-to-regulate-open-source-ai-is-counterproductive/" target="_blank" rel="nofollow noopener noreferrer">this piece.</a>
While <a href="https://twitter.com/etzioni" target="_blank" rel="nofollow noopener noreferrer"><strong>Oren Etzioni</strong></a>, the founding CEO of the
<a href="https://allenai.org/" target="_blank" rel="nofollow noopener noreferrer">Allen Institute for AI</a> adds that such regulation could
create an undue burden where only large tech companies could comply:</p>
<blockquote>
<p>“Open source developers should not be subject to the same burden as those
developing commercial software. It should always be the case that free
software can be provided ‘as is’ — consider the case of a single student
developing an AI capability; they cannot afford to comply with EU regulations
and may be forced not to distribute their software, thereby having a chilling
effect on academic progress and on reproducibility of scientific results.”</p>
</blockquote>
<p>The article discusses some proponents to the Act, as well as alternative thought
processes on the granularity of regulations (product vs. category, or downstream
responsibility). Finally, it ends with some thoughts from Hugging Face CEO,
<a href="https://twitter.com/ClementDelangue" target="_blank" rel="nofollow noopener noreferrer"><strong>Clément Delangue</strong></a> and his colleagues'
comments on the vagueness and the problems that can arise out of this lack of
clarity, including stifling competition and innovation. They also point out the
growing Responsible AI initiatives such as AI licensing and model cards
outlining the intended use of such open source technology as positives that are
community-born.</p>
<p>So does regulation stifle technology or provide guard rails?</p>
<p>My colleague <a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> would like
to point out that similar concerns were raised when the EU introduced the GDPR
in 2016, which has turned out to be of major importance to people's rights to
privacy — in the EU and worldwide.</p>
<p>To what degree should AI technology be regulated? Where do you draw lines? It’s
quite clear that it moves faster than lawmakers can keep up with and the
potential for harm is well known at this point. We could say, as I believe, that
reflection on the consequences should be baked into the building process.
However, the reality in practice is that —despite best intentions— the
overarching push for better and faster often results in negative consequences
that are only discovered after the fact.</p>
<p>How do we incentivize reflecting on consequences in our processes? Would
regulation force this? Make development slower, but necessarily force the social
good work that must be done in the development of AI tech?</p>
<p>What other industries have similar dilemmas and how do they handle it? The
Hippocratic Oath has served medicine well for thousands of years.<br>
<a href="https://ojs.aaai.org/index.php/aimagazine/article/view/15090" target="_blank" rel="nofollow noopener noreferrer">Do We Need a Hippocratic Oath for Artificial Intelligence Scientists?</a></p>
<h3 id="pulse-check" style="position:relative;">Pulse Check<a href="#pulse-check" aria-label="pulse check permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We would love to hear (read) your thoughts on this! We are starting a “Pulse
check” topic from the Heartbeat each month up for discussion in our Discord
server in the General channel.
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Come join the discussion!</a></p>
<p><img src="https://media.giphy.com/media/W5JywCYOCSP8VMiVZg/giphy.gif" alt="Heartbeat GIF"></p>
<h2 id="iterative-community-news" style="position:relative;">Iterative Community News<a href="#iterative-community-news" aria-label="iterative community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry" style="position:relative;"><strong>Francesco Calcavecchia</strong> - We refused to use a hammer on a screw: Story of GTO-based model registry<a href="#francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry" aria-label="francesco calcavecchia we refused to use a hammer on a screw story of gto based model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a>
<a href="https://medium.com/@francesco.calcavecchia/we-refused-to-use-a-hammer-on-a-screw-story-of-a-gto-based-model-registry-c540ac5d129f" target="_blank" rel="nofollow noopener noreferrer">wrote a piece</a>
in <a href="https://medium.com" target="_blank" rel="nofollow noopener noreferrer">Medium</a> about building a custom model registry with
<a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>.</p>
<p>He acknowledges the main reasons for needing a model registry as:</p>
<ol>
<li>When you need model versioning</li>
<li>When you need to promote or assign models to different stages</li>
<li>When you need to establish production model governance</li>
</ol>
<p>Additionally, he finds registering the data analysis and model evaluation
outputs into an artifact registry is necessary, and as such used GTO and DVC to
accomplish this. He goes into more detail about why he chose GTO over MLFlow -
essentially appreciating our UNIX philosophy that empowers agility over
prescriptive methods that hamper your design choices. He notes:</p>
<blockquote>
<p><strong>It is hard to think of something simpler than this. And simplicity is
beauty</strong> ❤️</p>
</blockquote>
<p>He then discusses some things he found missing for his needs, such as using it
in a production pipeline as opposed to committing models by hand. He discusses
working on solutions to build the artifact registry, introduce new commands, and
streamline the process for the <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> remote storage secret requirements.
Please join him in his contributions. We love to see where this is going! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/07b6eea95e4eb33178c818d1a1e0578a/03346/artifact-gto.jpg" alt="DVC GTO Artifact Registry schematic" title="DVC GTO Artifact Registry schematic" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Francesco Calcavecchia's schematic for a proposed artifact registry with DVC
and GTO
(<a href="https://medium.com/@francesco.calcavecchia/we-refused-to-use-a-hammer-on-a-screw-story-of-a-gto-based-model-registry-c540ac5d129f" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml" style="position:relative;">MLOps Course at the Technical University of Denmark includes DVC and CML<a href="#mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml" aria-label="mlops course at the technical university of denmark includes dvc and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/446b7dcde514e3a7faeec88a200ee44e/03346/dtu-mlops.jpg" alt="DTU MLOps Course Memes" title="DTU MLOps Course Meme" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
The <a href="https://www.dtu.dk/english" target="_blank" rel="nofollow noopener noreferrer">Technical University of Denmark (DTU)</a> has
included DVC and CML in its MLOps Course at the University. The lectures,
slides, exercises, and code can be found in
<a href="https://github.com/SkafteNicki/dtu_mlops" target="_blank" rel="nofollow noopener noreferrer">this repo</a> from
<a href="https://github.com/SkafteNicki" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicki Skafte Detlefsen</strong></a>, Postdoc in the
section of Cognitive Systems at the University with a focus on generative models
and geometrical deep learning. There are 10 sections covering:</p>
<ol>
<li>Getting started</li>
<li>Organization and version control (find Git and DVC here)</li>
<li>Reproducibility</li>
<li>Debugging and logging</li>
<li>Continuous X (find CML here)</li>
<li>The Cloud</li>
<li>Scalable applications</li>
<li>Deployment</li>
<li>Monitoring</li>
<li>Extra Resources</li>
</ol>
<p>The materials are great and even include some funny memes. Isn't an open-source
model amazing for learning? Cheers to DTU for including our tools and the open
source sharing of these learning materials with the world!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 640px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/61262120d4bbd20269bd3e99788d3d50/bbe0c/dtu-bad-code.jpg" alt="DTU bad code comic" title="DTU bad code comic" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Good code review vs. Bad code review
(<a href="https://github.com/SkafteNicki/dtu_mlops/blob/main/s2_organisation_and_version_control/S2.md" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="goku-mohandas---made-with-ml-mlops-interactive-course" style="position:relative;"><strong>Goku Mohandas</strong> - Made With ML MLOps Interactive Course<a href="#goku-mohandas---made-with-ml-mlops-interactive-course" aria-label="goku mohandas made with ml mlops interactive course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You likely already know of <a href="https://github.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer"><strong>Goku Mohandas'</strong></a>
wildly popular free course <a href="https://madewithml.com/#mlops" target="_blank" rel="nofollow noopener noreferrer">Made with ML</a>, which
includes DVC. Knowing that it can be challenging to learn everything on your
own, he is starting an interactive class beginning on October 1st. The deadline
for application is September 25th.<br>
<a href="https://madewithml.com/#interactive-course" target="_blank" rel="nofollow noopener noreferrer">For more info find the details here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/149517f06f19a9dd9fff3925347b556a/39600/made-with-ml.png" alt="Goku Mohandas - Made with ML MLOps" title="Goku Mohandas - Made with ML MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Goku Mohandas' Made with ML Interactive Course
(<a href="https://madewithml.com/#mlops" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="adrià-romero---youtube-review-of-dvc" style="position:relative;"><strong>Adrià Romero</strong> - YouTube review of DVC<a href="#adri%C3%A0-romero---youtube-review-of-dvc" aria-label="adrià romero youtube review of dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/adriaromero/" target="_blank" rel="nofollow noopener noreferrer"><strong>Adrià Romero</strong></a>, Computer Vision
Developer at <a href="https://www.lakera.ai/" target="_blank" rel="nofollow noopener noreferrer">Lakera</a>, has a regular tool review on
tools that can make computer vision easier, and recently reviewed DVC. He does a
demo of DVC pushing up to a Google Drive remote and goes over how to share
DVC-tracked data. He then covers the data pipelines functionality that can be
used for CI/CD pipelines and shows the benefits of tracking the versions of
everything including data, models, pipelines, parameters, and experiments.
Finally, he mentioned that our documentation is super clear and useful, which
makes us very happy. 🦉Check out the review below.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DXlxr4sEnc0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="sydney-firmin---reproducibility-replicability-and-data-science" style="position:relative;"><strong>Sydney Firmin</strong> - Reproducibility, Replicability, and Data Science<a href="#sydney-firmin---reproducibility-replicability-and-data-science" aria-label="sydney firmin reproducibility replicability and data science permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/de2134806780269b20cad03f670f6247/8a54c/the_difference.png" alt="Sydney Firmin - Reproducibility, Replicability, and Data Science" title="Sydney Firmin - Reproducibility, Replicability, and Data Science" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p><a href="https://www.linkedin.com/in/sydney-f-4369a65b/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sydney Firmin</strong></a> writes
<a href="https://www.kdnuggets.com/2019/11/reproducibility-replicability-data-science.html" target="_blank" rel="nofollow noopener noreferrer">a wonderful piece</a>
in KD Nuggets outlining the replicability crisis, the importance of
reproducibility in science in general and data science in particular. She
highlights the growing awareness of irreproducible research due to technology's
help to make all research better circulated. She encourages standardizing a
paradigm of reproducibility in data science work to promote efficiency,
accuracy, and to help your future self and colleagues check work and reduce
bugs.</p>
<p>Of course, she recommends DVC as a possible tool to help with this and notes,</p>
<blockquote>
<p>fun fact, this is my second attempt at writing this post after my computer was
<a href="https://en.wikipedia.org/wiki/Brick_(electronics)" target="_blank" rel="nofollow noopener noreferrer">bricked</a> last week. I am
now compulsively saving all of my work
in <a href="https://www.vox.com/2015/4/30/11562024/too-embarrassed-to-ask-what-is-the-cloud-and-how-does-it-work" target="_blank" rel="nofollow noopener noreferrer">the cloud</a>.</p>
</blockquote>
<p>Haven’t we all been there? 🙋🏻♀️ She goes on to describe other contributors to
irreproducible results including p-hacking and discusses other methods in
addition to tooling that can help, such as preventing overfitting and using a
sufficiently large dataset, and team review. All this and some fun xkcd comics
can be found in the post including <a href="https://xkcd.com/242" target="_blank" rel="nofollow noopener noreferrer">this one shown above</a>!</p>
<details>
<p>Speaking of xkcd comics,
<a href="https://github.com/casperdcl" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa Luis</strong></a>, CML Product Manger,
loves xkcd and regularly regales us with the comics in our internal Slack. He is
also an expert at TL;DRing (yes, I just made that a verb). Part of his process
in this excellence is to
“<a href="https://tldr.cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">suppress my latent desire to add a relevant xkcd comic</a>.”
As you can see, they do not appear every day. Self-discipline is a good thing.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fc1c3a4e4b20260e7d3f261f11d650c0/39600/casper-xkcd.png" alt="Casper da Costa Luis and xkcd comics" title="Casper da Costa Luis and xkcd comics" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Casper da Costa Luis' propensity for Slack slinging xkcd comics</em></p>
<summary id="iterative-xkcd-lore">😄 Iterative xkcd Lore</summary>
</details>
<h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/ji6BdEco3I29DTXddx/giphy-downsized-large.gif" alt="Happy Dog Food GIF by Diamond Pet Foods"></p>
<h3 id="mlem-mlem-mlem-this-dog-food-is-good" style="position:relative;">MLEM, MLEM, MLEM, this dog food is good!<a href="#mlem-mlem-mlem-this-dog-food-is-good" aria-label="mlem mlem mlem this dog food is good permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>So over the summer, you may have noticed that our blog has moved from the
<a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> website to the <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative</a> website.
This is because as we now have many more tools than DVC, we wanted to make a
blog home for them all. In this transition, we have also changed our internal
blog writing process from being just Git-dependent to Git- and DVC- dependent,
such that the writing is in Git, but the images are versioned with DVC and
stored in a remote. 🤗</p>
<p>This admittedly may be like bringing a
<a href="https://arclightcnc.com/product/cnc-router-kit" target="_blank" rel="nofollow noopener noreferrer">CNC router</a> to a steak dinner
(I feel like there should be a Myth Busters episode on this). <strong>But</strong> it will
help both the DevRel team and the Websites team become intimately familiar with
what our users feel when using our tools and potentially drive more feature
improvements for you. In other words, we ❤️ you and we're really serious about
making our tools better for you so you don't have to build them yourselves!</p>
<p><img src="https://media.giphy.com/media/wdA6Ql7ku32JZKXBFV/giphy.gif" alt="Ken Jeong Masked Singer GIF by FOX TV"></p>
<h3 id="alex-kim-oreilly-mlops-course" style="position:relative;"><strong>Alex Kim</strong> O'Reilly MLOps Course<a href="#alex-kim-oreilly-mlops-course" aria-label="alex kim oreilly mlops course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ed1f0f9c5183f7978b8ff0cec495e436/39600/alex-oreilly.png" alt="Open-source MLOps in 4 weeks with Alex Kim" title="Open Source MLOps in 4 weeks" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> is working with
<a href="https://www.oreilly.com/" target="_blank" rel="nofollow noopener noreferrer">O'Reilly</a> on a course entitled <em>Open-source MLOps in
4 weeks</em>. Here is an outline of what you will be learning in the course which
starts on November 8th and again on January 10th:</p>
<ul>
<li>Week 1: Kick-starting an ML project</li>
<li>Week 2: ML pipelines and reproducibility</li>
<li>Week 3: Serving ML models as web API services</li>
<li>Week 4: CI/CD and monitoring for ML projects</li>
</ul>
<p><a href="https://learning.oreilly.com/live-events/open-source-mlops-in-4-weeks/0636920080215/0636920080214/" target="_blank" rel="nofollow noopener noreferrer">Head here to sign up for the course</a></p>
<h3 id="latam-ai" style="position:relative;">LATAM AI<a href="#latam-ai" aria-label="latam ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> and our lead docs
writer, <a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Orpinel Perez</strong></a>, got to
experience <a href="https://www.latam-ai.com/" target="_blank" rel="nofollow noopener noreferrer">LATAM AI</a> this year. Gema gave the talk
<em>Reproducibility and version control are important: Follow-up experiments with
the DVC extension for VS Code</em>. Both Gema and Jorge enjoyed the conference and
meeting lots of people. Below you can see Gema with the winners of our DeeVee's
Ramen Run Game. In the game, players have to roam DeeVee city answering
questions to win Ramen and the highest place on the leaderboard. Get yourself to
one of the conferences we are attending to play! See winners Miguel Moran
Flores, Efren Bautista Linares and Rodofo Ferro below.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c4a5039960064b937e6b3c52018a78be/03346/latam-ai-winners.jpg" alt="Efren Bautista Linares, Miguel Moran Flores, Rodolfo Ferro with Gema Parreño" title="Efren Bautista Linares, Miguel Moran Flores, Rodolfo Ferro with Gema Parreño" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Winners of DeeVee's Ramen Run game with Gema, Left to Right: Efren Bautista
Linares, Miguel Moran Flores, Gema Parreño Piqueras, and Rodolfo Ferro</em></p>
<h3 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/ronan-lamy-84133612/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ronan Lamy</strong></a> joins the DVC
team from Bristol, UK. He has a Ph.D. in physics and had been working as an
open-source contractor as core dev of PyPy and HPy before joining Iterative.
When he's not working Ronan enjoys exploring the many fine restaurants and great
local beers of Bristol. Originally from France, Ronan recently shared with me
that his friends and family back home don't believe that the food can be so good
in Bristol, but he insists it is. Add it to your bucket list! When in Bristol,
Ronan has recommendations for you!</p>
<p><a href="https://github.com/nimdraugsael" target="_blank" rel="nofollow noopener noreferrer"><strong>Aleksei Shaikhaleev</strong></a> joins the Studio team
as a backend developer. Originally from Russia, Aleksei has called Phuket,
Thailand his home base for the last 10 years. When he's not working, he's really
into surfing, skateboarding, motorcycles, and other fun activities like these.
Aleksei also has a heart for rescuing cats, having adopted and caring for five
stray cats at home!</p>
<p><a href="https://www.linkedin.com/in/david-tulga-60b29410/" target="_blank" rel="nofollow noopener noreferrer"><strong>David Tulga</strong></a> is our
latest hire, and joins the LDB team from California as a Senior Software
Engineer. He previously worked at Asimov and Freenome. When not working David
enjoys a variety of outdoor activities such as Biking, Hiking, Kayaking,
Sailing, and Astronomy.</p>
<p>David's arrival marks the 4th David on the team, putting the name David in a
three-way tie with versions of Daniel and Alexander! Indeed over 20% of our
workforce is named David, Daniel, or Alexander. 😅</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the open positions. Please share with anyone looking to
have a lot of fun building the next generation of machine learning to production
tools! 🚀 But don't apply if your name is David, Daniel, or Alexander. Unless
you're willing to be nick-named, of course! It's getting confusing around here.
😂</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/14a8b26dd92c5a19428d6a7bef2078f0/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="new-blog-posts" style="position:relative;">✍🏼 New Blog posts<a href="#new-blog-posts" aria-label="new blog posts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> created a tutorial for
using CML with <a href="https://bitbucket.org/" target="_blank" rel="nofollow noopener noreferrer">Bitbucket</a>, which CML now supports. Be
sure to read it if Bitbucket is your Git provider of choice!</li>
<li><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras'</strong></a>
<a href="https://dvc.org/blog/august-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">August Community Gems</a> is full
of great questions from the Community from our
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a>.</li>
</ul>
<h2 id="upcoming-conferences" style="position:relative;">Upcoming Conferences<a href="#upcoming-conferences" aria-label="upcoming conferences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Conferences we will be attending through the end of the year:</p>
<ul>
<li>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> and
<a href="https://github.com/mike0sv" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Sveshnikov</strong></a> will be giving a talk and
workshop on our GitOps approach to a Model registry at
<a href="https://twimlai.com/conf/twimlcon/2022/" target="_blank" rel="nofollow noopener noreferrer">TWIML Con</a> on October 4-7 (On-line)</p>
</li>
<li>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will speak at
<a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco on November 1-3 on
the same topic</p>
</li>
<li>
<p><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> will be speaking at
<a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin, October 5-6
with the talk <em>Becoming a Pokémon Master with DVC: Experiment Pipelines for
Deep Learning Projects</em></p>
</li>
<li>
<p><a href="https://cdcl.ml/" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa Luis</strong></a> will be giving the talk <em>Painless
cloud orchestration without leaving your IDE</em> at
<a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> -
London, November 8-9</p>
</li>
<li>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will be speaking at
<a href="https://www.githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on November 9-10 with the
talk <em>Connecting Machine Learning with Git: ML experiment tracking with
Codespaces</em>!</p>
</li>
<li>
<p>Finally, we will be participating in
<a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> -
November 29-30 in Toronto, talks TBD</p>
<h2 id="tweet-love" style="position:relative;">❤️ Tweet Love<a href="#tweet-love" aria-label="tweet love permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We loved finding DVC and CML used for benchmarking and reporting at
<a href="https://huggingface.co" target="_blank" rel="nofollow noopener noreferrer">Huggingface</a> thanks to the tip-off from
<a href="https://twitter.com/osanseviero" target="_blank" rel="nofollow noopener noreferrer">Omar Sanseviero</a>! Look out for more projects
involving Hugginface and our tools coming soon!</p>
</li>
</ul>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://twitter.com/huggingface">@huggingface</a> datasets uses <a href="https://twitter.com/DVCorg">@DVCorg</a> & CML for benchmark and reporting 🥰 . More about the .yaml structure here --> <a href="https://t.co/NY5FMzjNuR">https://t.co/NY5FMzjNuR</a> Glad to discover common opensourceness <a href="https://twitter.com/osanseviero">@osanseviero</a> ! 🤗🥹🦉</p>— Gema Parreño (@SoyGema) <a href="https://twitter.com/SoyGema/status/1567824457296642048">September 8, 2022</a></blockquote>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/bitbucket-cml-runnershttps://dvc.org/blog/bitbucket-cml-runnersTue, 06 Sep 2022 00:00:00 GMT<p>A while ago, we learned about
<a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">training models in the cloud and saving them in Git</a>.
We did so using <a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML and GitHub Actions</a>.
GitLab is <a href="https://cml.dev/doc/start/gitlab" target="_blank" rel="nofollow noopener noreferrer">also supported</a>, and a
<a href="https://github.com/iterative/cml/releases/tag/v0.16.0" target="_blank" rel="nofollow noopener noreferrer">recent CML release</a>
incorporated support for self-hosted runners in Bitbucket Pipelines: a good
excuse to revisit this topic and show how CML works in conjunction with
Bitbucket's CI/CD.</p>
<p>Using CML to provision cloud instances for our model (re)training has a number
of benefits:</p>
<ul>
<li>Bring Your Own Cloud: a single CML command connects your existing cloud to
your existing CI/CD</li>
<li>Cloud abstraction: CML handles the interaction with our cloud provider,
removing the need to configure resources directly. We could even switch cloud
providers by changing a single parameter</li>
<li>Auto-termination: CML automatically terminates instances once they are no
longer being used, reducing idle time (and costs)</li>
</ul>
<h1 id="what-well-be-doing" style="position:relative;">What we'll be doing<a href="#what-well-be-doing" aria-label="what well be doing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This guide will explore how we can use CML to (re)train models from one of our
Bitbucket pipelines. We will:</p>
<ol>
<li>Provision an EC2 instance on Amazon Web Services (AWS) from a Bitbucket
pipeline</li>
<li>Train a machine learning model on the provisioned instance</li>
<li>Open a pull request that adds the resulting model to our Bitbucket repository</li>
</ol>
<p>While we could use Bitbucket's own runners for our model training, they have
<a href="https://support.atlassian.com/bitbucket-cloud/docs/limitations-of-bitbucket-pipelines/#LimitationsofBitbucketPipelines-Buildlimits" target="_blank" rel="nofollow noopener noreferrer">limited</a>
memory, storage, and processing power. Self-hosted runners let us work around
these limitations: we can get a runner with specifications tailored to our
computing needs. CML greatly simplifies the setup and orchestration of these
runners.</p>
<p>Moreover, if our data is hosted by our cloud provider, using a runner on the
same cloud would be a logical approach to minimize data transfer costs and time.</p>
<admon type="tip">
<p>While we'll be using
<a href="https://cml.dev/doc/self-hosted-runners?tab=AWS#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">AWS</a>
in this guide, CML works just as well with
<a href="https://cml.dev/doc/self-hosted-runners?tab=GCP#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Platform</a>,
<a href="https://cml.dev/doc/self-hosted-runners?tab=Azure#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure</a>,
and
<a href="https://cml.dev/doc/self-hosted-runners#on-premise-local-runners" target="_blank" rel="nofollow noopener noreferrer">on-premise</a>
machines. Of course, CML would need the appropriate credentials, but otherwise,
it takes care of the differing configuration for us.</p>
</admon>
<h1 id="before-we-start" style="position:relative;">Before we start<a href="#before-we-start" aria-label="before we start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>You can clone the repository for this guide
<a href="https://bitbucket.org/iterative-ai/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p>To help follow along, you may want to keep the
<a href="https://cml.dev/doc/start/bitbucket" target="_blank" rel="nofollow noopener noreferrer">Getting started section of the CML docs</a>
open in another tab. The docs cover the following prerequisite steps you'll need
to take if you want to follow along with this blog post:</p>
<ol>
<li><a href="https://cml.dev/doc/self-hosted-runners?tab=Bitbucket#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">Generate a <code>REPO_TOKEN</code> and set it as a repository variable</a>.</li>
<li><a href="https://cml.dev/doc/ref/send-comment#bitbucket" target="_blank" rel="nofollow noopener noreferrer">Install the <em>Pull Request Commit Links app</em> in your Bitbucket workspace</a></li>
</ol>
<p>Additionally, you will need to take the following steps to allow Bitbucket to
provision AWS EC2 instances on your behalf:</p>
<ol>
<li><a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-creds" target="_blank" rel="nofollow noopener noreferrer">Create an <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> on AWS</a></li>
<li><a href="https://support.atlassian.com/bitbucket-cloud/docs/variables-and-secrets/" target="_blank" rel="nofollow noopener noreferrer">Add the <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> as repository variables</a></li>
</ol>
<admon type="warn">
<p>In this example, we will be provisioning an <code>m5.2xlarge</code>
<a href="https://aws.amazon.com/ec2/instance-types/" target="_blank" rel="nofollow noopener noreferrer">AWS EC2 instance</a>. Note that this
instance is not included in the free tier, and Amazon
<a href="https://aws.amazon.com/ec2/pricing/on-demand/" target="_blank" rel="nofollow noopener noreferrer">will charge you for your usage</a>
($0.45 per hour at the time of writing). To minimize cost, CML always terminates
the instance upon completion of the pipeline.</p>
</admon>
<h1 id="implementing-the-cml-bitbucket-pipeline" style="position:relative;">Implementing the CML Bitbucket pipeline<a href="#implementing-the-cml-bitbucket-pipeline" aria-label="implementing the cml bitbucket pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>The main point of interest in the project repository is the
<code>bitbucket-pipelines.yml</code> file. Bitbucket will automatically recognize this file
as the one containing our pipeline configuration. In our case, we have defined
one pipeline (named <code>default</code>) that consists of two steps:</p>
<h2 id="launch-self-hosted-runner" style="position:relative;">Launch self-hosted runner<a href="#launch-self-hosted-runner" aria-label="launch self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the first step, we specify the runner we want to provision. We use a CML
docker image and configure a runner on a medium (<code>m</code>) instance. CML
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">automatically translates this generic type to a cloud-specific one</a>.
In the case of AWS, this corresponds with an <code>m5.2xlarge</code> instance.</p>
<p>We also specify the <code>--idle-timeout=30min</code> and <code>--reuse-idle</code> options. The first
of these specifies how long the provisioned instance should wait for jobs before
it is terminated. This ensures that we are not racking up costs due to our
instances running endlessly. With the latter, we ensure that a new instance is
only provisioned when a runner is not already available with the same label.
Combining these two options means that we can automatically scale up the number
of runners (if there are multiple pull requests in parallel) and scale down when
they are no longer required.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">step</span><span class="token punctuation">:</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=m \
--idle-timout=30min \
--reuse-idle \
--labels=cml.runner</span></code></pre></div>
<admon type="tip">
<p>CML <a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">has many more options</a> that might pique
your interest. For example, you could use <code>--single</code> to terminate instances
right after completing one job. Or you could set a maximum bidding price for
spot instances with <code>--cloud-spot-price=...</code>. With these features, CML helps you
tailor instances precisely to your needs.</p>
</admon>
<h2 id="train-model-on-self-hosted-runner" style="position:relative;">Train model on self-hosted runner<a href="#train-model-on-self-hosted-runner" aria-label="train model on self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The second step in our pipeline defines the model training task. We specify that
this step should run on the <code>[self.hosted, cml.runner]</code> we provisioned above.
From here, our script defines the individual commands as we could also run them
in our local terminal.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">step</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self.hosted<span class="token punctuation">,</span> cml.runner<span class="token punctuation">]</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token comment"># GPU not yet supported, see https://github.com/iterative/cml/issues/1015</span>
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>r requirements.txt
<span class="token punctuation">-</span> python get_data.py
<span class="token punctuation">-</span> python train.py
<span class="token comment"># Create pull request</span>
<span class="token punctuation">-</span> cml pr model/random_forest.joblib
<span class="token comment"># Create CML report</span>
<span class="token punctuation">-</span> cat model/metrics.txt <span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> echo '' <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> echo '<span class="token tag">!</span><span class="token punctuation">[</span>Confusion Matrix<span class="token punctuation">]</span>(model/confusion_matrix.png)' <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update <span class="token punctuation">-</span><span class="token punctuation">-</span>publish report.md</code></pre></div>
<p>First, we install our requirements, and then we run our data loading and model
training scripts. At this point, our runner contains our newly trained model.
However, we need to take a few extra steps to do something with that model.
Otherwise, our results would be lost when CML terminates the instance.</p>
<p>To add our model to our repository, we create a pull request with <code>cml pr</code>. We
also create a CML report that displays the model performance in the pull
request. We add the metrics and the confusion matrix created in <code>train.py</code> to
the report, and <code>cml send-comment</code> updates the description of the pull request
to the contents of <code>report.md</code> (i.e., our <code>metrics.txt</code> and confusion matrix).</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 482.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3db3ba81ce9aa53bd01025d6ef50cd79/39600/pr-screenshot.png" alt="The model training report in the pull
request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>The
resulting pull request showing the model training report</em></p>
<p>That's all there is to it! Once CML has created the pull request, we can merge
it on Bitbucket. CML will automatically terminate the cloud instance after its
specified idle time, thus saving us from high AWS expenses.</p>
<admon type="tip">
<p>You might be interested in storing the resulting model in a DVC remote, rather
than in your Git repository.
<a href="https://iterative.ai/blog/CML-runners-saving-models-2" target="_blank" rel="nofollow noopener noreferrer">Follow this guide to learn how to do so</a>.</p>
</admon>
<h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>CML allows us to incorporate our model training into our Bitbucket CI/CD. We can
define a pipeline to provision a cloud instance that meets our requirements and
then use the instance to train our model. The resulting model can be pushed to
our Git repository, along with a detailed report on our model's performance.</p>
<p>Because CML handles the interaction with our cloud provider of choice, we can
switch between different providers (AWS, Azure, or Google Cloud Project) by
changing a single line. Moreover, CML automatically reduces our cloud expenses
by terminating instances we are no longer using.</p>
<p>Now that we got started with CML in Bitbucket Pipelines, we can look toward some
of CML's more advanced features. It might be worth exploring CML's spot
recovery, for example, which can pick up training from the last epoch when a
script is randomly terminated. Or we might be interested in training models on
GPUs, which CML is also well-suited for.</p>
<p>These topics warrant their own guides, however. Keep an eye out for these
follow-ups on our blog, and make sure to let us know what you would like us to
cover next! You can let us know in the comments or by
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">joining our Discord server</a>.</p>https://dvc.org/blog/august-22-community-gemshttps://dvc.org/blog/august-22-community-gemsTue, 30 Aug 2022 00:00:00 GMT<p>Hi there! This is Gema! Today I'll be the guide to Community Gems for August.
Big shout out to <a href="https://twitter.com/flippedcoding" target="_blank" rel="nofollow noopener noreferrer">Milecia Mcgregor</a> that
co-authors this post.</p>
<h2 id="if-i-am-tracking-a-directory-with-dvc-how-can-i-read-the-file-names-without-using-dvc-checkout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1001787488173572147" target="_blank" rel="nofollow noopener noreferrer">If I am tracking a directory with DVC, how can I read the file names without using <code>dvc checkout</code>?</a><a href="#if-i-am-tracking-a-directory-with-dvc-how-can-i-read-the-file-names-without-using-dvc-checkout" aria-label="if i am tracking a directory with dvc how can i read the file names without using dvc checkout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This is a wonderful question from @Mikita Karotchykau!</p>
<p>You can read those file names with our DVC Python API. Here's an example of how
that may work:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> os
<span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo
<span class="token keyword">for</span> item <span class="token keyword">in</span> Repo<span class="token punctuation">.</span>ls<span class="token punctuation">(</span>
<span class="token string">"<repo_path_or_url>"</span><span class="token punctuation">,</span>
<span class="token string">"/path/to/dir"</span><span class="token punctuation">,</span>
dvc_only<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span>
rev<span class="token operator">=</span><span class="token string">"<rev>"</span><span class="token punctuation">,</span>
recursive<span class="token operator">=</span><span class="token boolean">True</span>
<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token string">"/path/to/dir"</span><span class="token punctuation">,</span> item<span class="token punctuation">[</span><span class="token string">"path"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div>
<h2 id="how-can-i-mock-the-execution-of-certain-stages-in-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1004408394888777738" target="_blank" rel="nofollow noopener noreferrer">How can I mock the execution of certain stages in <code>dvc repro</code>?</a><a href="#how-can-i-mock-the-execution-of-certain-stages-in-dvc-repro" aria-label="how can i mock the execution of certain stages in dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Nice situation posted as a question from @JesusCerquides!</p>
<p>This situation might arise when you have stages that take a long time to run or
when you are confident about them and want to advance with the pipeline design;
therefore, you wouldn't want to reproduce all again. One example might be when
you have a good enough feature engineering and want to iterate over
hyperparameters in training.</p>
<p>You should be able to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> in this case as it provides a way to
complete <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> when it has been used with the <code>--no-commit</code> or <code>--no-exec</code>
options. Those options cause the command to skip certain stages so you can move
to another stage without executing all of them.</p>
<h2 id="how-can-i-change-the-dataset-for-a-dvc-pipeline-that-runs-completely-with-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1004432985052942396" target="_blank" rel="nofollow noopener noreferrer">How can I change the dataset for a DVC pipeline that runs completely with <code>dvc repro</code>?</a><a href="#how-can-i-change-the-dataset-for-a-dvc-pipeline-that-runs-completely-with-dvc-repro" aria-label="how can i change the dataset for a dvc pipeline that runs completely with dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Great question from @5216!</p>
<p>One of the straightforward solutions for this challenge is to replace the
dataset in place and run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> again. If the dataset is at some other
path, you can update <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to use that new path instead of the original
dataset path. If you don't want to lose the previous pipeline and want to keep
it and results for future reproducibility or other needs, you can use
<a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> as it keeps a record in Git of all changes and allows you to
create a branch if needed.</p>
<h2 id="when-i-trigger-a-github-event-i-use-pull_request-types-labeled-and-it-seems-to-cause-the-runner-to-use-the-wrong-sha-how-can-i-fix-this" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/1001003933159915550" target="_blank" rel="nofollow noopener noreferrer">When I trigger a GitHub event, I use <code>pull_request: types: [labeled]</code> and it seems to cause the runner to use the wrong SHA. How can I fix this?</a><a href="#when-i-trigger-a-github-event-i-use-pull_request-types-labeled-and-it-seems-to-cause-the-runner-to-use-the-wrong-sha-how-can-i-fix-this" aria-label="when i trigger a github event i use pull_request types labeled and it seems to cause the runner to use the wrong sha how can i fix this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thanks for the good question @hyojoo!</p>
<p>You might have encounter that this issue doesn´t allow you to send comments to
the PR. A
<a href="https://github.com/iterative/cml/issues/880#issuecomment-1145522505" target="_blank" rel="nofollow noopener noreferrer">change</a>
with respect to the SHAs made us point to the head reference.</p>
<p>We've updated <a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML Start</a> to include a fix:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v3
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">ref</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event.pull_request.head.sha <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div>
<h2 id="how-does-dvc-solve-the-file-versioning-problem-specifically-when-we-want-to-roll-back-to-previous-versions-of-the-dataset" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1005130028692017184" target="_blank" rel="nofollow noopener noreferrer">How does DVC solve the file versioning problem, specifically when we want to roll back to previous versions of the dataset?</a><a href="#how-does-dvc-solve-the-file-versioning-problem-specifically-when-we-want-to-roll-back-to-previous-versions-of-the-dataset" aria-label="how does dvc solve the file versioning problem specifically when we want to roll back to previous versions of the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Time travel with DVC ! We just find this topic fascinating. Thanks for bringing
this up @MiaM</p>
<p><code>git checkout</code> command lets us restore any commit in the repository history. It
will automatically adjust the repository files, by replacing, adding or deleting
them. This git command changes <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and another DVC files, meaning that
git tracks DVC files, but doesn´t track the file per se. For this to happen and
get back to previous versions of the dataset, make sure to <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> on
this one.</p>
<p>For reproducibility, we will see now what happens with the <code>data.dvc</code> file and
cache folder when we go back to a previous dataset version. For that, we will
add a dataset, change it and add it to DVC, and then get back to the first
dataset version.</p>
<p>First, we have added a dataset, and then add it as well with DVC: if we explore
the <code>data.xml.dvc</code> file and the cache folder , we will see the MD5 hash for the
file, a unique identifier!</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc <span class="token comment"># will show file info including MD5 hash</span>
</span>outs:
- md5: a8d60da582524dac805fc7b64d762e58
size: 33471
path: data.xml
<span class="token line"><span class="token input">$ </span><span class="token command">cd</span> .dvc/cache
</span><span class="token line"><span class="token input">$ </span><span class="token command">tree</span> <span class="token comment"># will show dataset in the cache with hash reference</span>
</span>.
|___ a8
|___ a8d60da582524dac805fc7b64d762e58
</code></pre></div>
<p>After changing the dataset, we have added it to DVC as well. As you can see in
<code>data.xml.dvc</code> file, the hash MD5 has changed, as the dataset is different! The
cache , however keeps both hashes. Smart!</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc <span class="token comment"># will show new file info including MD5 hash</span>
</span>outs:
- md5: 8e4ed00d7118e31340db6c0ba572658e
size: 35263
path: data.xml
<span class="token line"><span class="token input">$ </span><span class="token command">cd</span> .dvc/cache
</span><span class="token line"><span class="token input">$ </span><span class="token command">tree</span> <span class="token comment"># will show both datasets in the cache with their hash reference</span>
</span>.
|___ 8e
| |___ 4ed00d7118e31340db6c0ba572658e
|___ a8
|___ d60da582524dac805fc7b64d762e58</code></pre></div>
<p>Now let´s get back to the previous version of the dataset</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> HEAD~1 data/data.xml.dvc
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> data/data.xml.dvc</span></code></pre></div>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc
</span>outs:
- md5: a8d60da582524dac805fc7b64d762e58
size: 33471
path: data.xml</code></pre></div>
<p>Interesting! The hash makes reference to the previous version of our dataset
that has been stored in our cache folder. The cache folder saves the data so DVC
allows you to get back to previous files with the synced <code>git checkout</code> and
<a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> commands. Please note that you have to checkout with Git, but
also with DVC! If you always want to ensure <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> after <code>git checkout</code>
you can use <code>post-chekout</code>
<a href="https://dvc.org/doc/command-reference/install#installed-git-hooks" target="_blank" rel="nofollow noopener noreferrer">Git hook</a> to
automatically update the workspace with the correct data file versions.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 455px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/56e4e2a30bbced5dae70c20873eee9e8/39600/backtothefuture.png" alt="back to the future" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="how-can-i-plot-the-result-metrics-for-the-machine-learning-experiments-inside-vscode-dvc-extension-scenario" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/842220310585147452/991695952480043038" target="_blank" rel="nofollow noopener noreferrer">How can I plot the result metrics for the machine learning experiments inside VSCode DVC extension scenario?</a><a href="#how-can-i-plot-the-result-metrics-for-the-machine-learning-experiments-inside-vscode-dvc-extension-scenario" aria-label="how can i plot the result metrics for the machine learning experiments inside vscode dvc extension scenario permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Happy to discover that you are using
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension</a>
for VSCode @Julian_ !</p>
<p>You can define your plots with
<a href="https://dvc.org/doc/dvclive/dvclive-with-dvc" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> depending on your
machine learning challenge and save them as a CSV, JSON file or other
<a href="https://dvc.org/doc/user-guide/visualizing-plots#supported-file-formats" target="_blank" rel="nofollow noopener noreferrer">supported format</a>.
You need to list it as a plots output in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, adding plots in the build
stage</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">build</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> features.csv
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> model.pt
<span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">metrics.json</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">plots</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">metrics.csv</span><span class="token punctuation">:</span> <span class="token comment"># specify the name and .csv extension file</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div>
<h2 id="im-constructing-a-pipeline-with-several-stages-inside-the-dvcyaml-file" style="position:relative;">[Im constructing a pipeline with several stages inside the <code>dvc.yaml</code> file.<a href="#im-constructing-a-pipeline-with-several-stages-inside-the-dvcyaml-file" aria-label="im constructing a pipeline with several stages inside the dvcyaml file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When I execute dvc exp run or dvc repro commands, stages run randomly. What is
the reason behind this or did I miss something ?]
(<a href="https://discord.com/channels/485586884165107732/563406153334128681/1011617355849269258" target="_blank" rel="nofollow noopener noreferrer">https://discord.com/channels/485586884165107732/563406153334128681/1011617355849269258</a>)</p>
<p>Hello there @ekmekci48 ! That is indeed a really great question.</p>
<p>In order to ensure linear order in your pipeline, you should concatenate all
your pipeline stages, taking into account that the previous stage output will be
the next dependency, from the beginning to the end of your pipeline. Please make
sure that you specify dependencies and outputs for each stage: that will
introduce the order to provide an end result. For stages that don´t depend on
each other, they will still executed randomly.</p>
<p>As an example, imagine that we have 3 stages: load , feature engineering and
training. Load output with be feature engineering dependency, and feature
engineering output will be training dependency.</p>
<p>The key concept to have into account here is that you should concatenate the
output of one stage as the dependency of the other among all pipeline stages.</p>
<p>As an example, added some schema from our learning
<a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">course</a>: check out the <code>-o</code> and <code>-d</code> config flags .
Those will be key for concatenating your stages.</p>
<p>Let's also thank @daavoo for helping you out pointing to the docs on this one!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c84b6595555ff4070d3c1ac55e69caf5/39600/pipelines.png" alt="notes from pipelines lesson iterative learning course" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Please check out the <a href="https://dvc.org/doc/command-reference/dag" target="_blank" rel="nofollow noopener noreferrer">docs</a> to know
more!</p>
<hr>
<p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p>
<p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date
with us to find out what it is!
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a>
to stay up to date with specifics as we get closer to the event!</p>
<p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC, CML, and MLEM
questions answered!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the
community!</p>https://dvc.org/blog/august-22-heartbeathttps://dvc.org/blog/august-22-heartbeatTue, 16 Aug 2022 00:00:00 GMT<p>Welcome to the August Heartbeat! As we all soak in the remaining summer days,
swing along in your hammock and take in all the great news from the Iterative
Community!</p>
<p><img src="https://media.giphy.com/media/2uI9paIuAWgaqfyX0Q/giphy.gif" alt="Ukulele Hammock GIF by Northern Illinois University"></p>
<h1 id="from-greater-aiml-community" style="position:relative;">From Greater AI/ML Community<a href="#from-greater-aiml-community" aria-label="from greater aiml community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="vanishing-gradients-podcast" style="position:relative;">Vanishing Gradients Podcast<a href="#vanishing-gradients-podcast" aria-label="vanishing gradients podcast permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b59357e3392d514f5968b677fc40465/e2d37/vanishing-gradients.png" alt="Vanishing Gradients" title="Vanishing Gradients" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
If you are not familiar with
<a href="https://twitter.com/hugobowne" target="_blank" rel="nofollow noopener noreferrer"><strong>Hugo Bowne-Anderson</strong></a>, you should be. He was
the host of my all-time favorite Data Science podcast
<a href="https://www.datacamp.com/podcast" target="_blank" rel="nofollow noopener noreferrer">DataFramed</a> while he was at
<a href="https://www.datacamp.com/" target="_blank" rel="nofollow noopener noreferrer">DataCamp</a>. DataFramed helped me immeasurably when I
started my data science journey. It provided great not only great teachings on
many data science concepts, but even more importantly, the ability to gain
perspectives from different people across all parts of the data space, talking
about challenges, danger zones, and issues that we all need to be aware of in
the field. Recently Hugo started a new podcast,
<a href="https://vanishinggradients.fireside.fm/" target="_blank" rel="nofollow noopener noreferrer">Vanishing Gradients</a>. This newer
endeavor is in a somewhat different format than DataFramed, but still with
Hugo's characteristic deep dive into all the challenges that come up when
working with data. Hugo uses a long-format conversation approach with many
leaders and great thinkers in the data science/machine learning/AI space. In
episodes <a href="https://vanishinggradients.fireside.fm/7" target="_blank" rel="nofollow noopener noreferrer">seven</a> and
<a href="https://vanishinggradients.fireside.fm/8" target="_blank" rel="nofollow noopener noreferrer">eight,</a> Hugo has a fascinating chat
with <a href="https://twitter.com/pwang" target="_blank" rel="nofollow noopener noreferrer"><strong>Peter Wang</strong></a>, CEO of Anaconda, in which they
talk about a number of topics including how Python became so big in Data
Science, the emergence of open source collaborative environments, and things
that the PyData stack solves. Then it gets really interesting as they dive into
the open source model in the context of finite and infinite games and open
source software as a "paradigm of humanity's ability to create generative,
nourishing and anti-rivalrous systems." 🤯 Super interesting discussion and food
for thought. I've already listened to both episodes twice. I highly recommend
them and this new podcast in general.</p>
<h1 id="from-the-iterative-tools-community" style="position:relative;">From the Iterative tools Community<a href="#from-the-iterative-tools-community" aria-label="from the iterative tools community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="mikołaj-kania---can-dvc-be-used-for-kaggle" style="position:relative;"><strong>Mikołaj Kania</strong> - Can DVC Be Used for Kaggle?<a href="#miko%C5%82aj-kania---can-dvc-be-used-for-kaggle" aria-label="mikołaj kania can dvc be used for kaggle permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/MikolajKania" target="_blank" rel="nofollow noopener noreferrer"><strong>Mikołaj Kania</strong></a> suggests that you upgrade
your Kaggle competition workflow from the “spaghetti code” of Jupyter Notebooks
and use the more mature way of creating reproducible ML results by using DVC
<a href="https://mikolajkania.com/2022/08/07/dvc-kaggle-mlops/" target="_blank" rel="nofollow noopener noreferrer">here on his blog</a>.</p>
<p>He notes that notebooks are really bad to compare changes between runs. Instead,
he suggests developing a workflow where for every major experiment type,
creating a branch - experimenting in each and persisting the best and most
notable outcomes (good and bad). The best results are then submitted to Kaggle.
You can find more about his workflow in
<a href="https://github.com/mikolajkania/kaggle-03-house-prices" target="_blank" rel="nofollow noopener noreferrer">his repo for the project.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/58af1bbd7433b00332e27cde7b9dd4e1/39600/kaggle-dvc.png" alt="Using DVC for Kaggle Competition" title="Using DVC for Kaggle Competition" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>DVC with Kaggle
(<a href="https://mikolajkania.com/2022/08/07/dvc-kaggle-mlops/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>Mikołaj explains how DVC's project structure ensures reproducible results and
develops habits on best practices. One drawback he noted was the lack of an
experimentation UI, but we just introduced the DVC extension for VS Code to help
with that, and there’s always Iterative Studio. Look out for improvement to the
experiment features in both tools in the coming months! Also, experimenting with
DVC in Kaggle may give you some good practice for things we are cooking up
internally! 😉🤫</p>
<h2 id="shambhavi-mishra---searching-for-semantic-similarity" style="position:relative;"><strong>Shambhavi Mishra</strong> - Searching for Semantic Similarity<a href="#shambhavi-mishra---searching-for-semantic-similarity" aria-label="shambhavi mishra searching for semantic similarity permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/ShambhaviCodes" target="_blank" rel="nofollow noopener noreferrer"><strong>Shambhavi Mishra</strong></a> in her post
<a href="https://medium.com/towards-artificial-intelligence/searching-for-semantic-similarity-cfbff2388d04" target="_blank" rel="nofollow noopener noreferrer">Searching for Semantic Similarity</a>
details the steps of her NLP project on similarity algorithms. She mainly
focuses on cosine similarity using a Stack Overflow questions dataset. The
end-to-end project uses Sentence BERT, Fast Text, DVC, DAGsHub, Streamlit and
deploys the web app on an AWS EC2 instance.</p>
<p>Once you follow all the steps you will have computed the similarity between a
search query and a database of texts and rank all the data by their similarity
score to retrieve the most similar text to its index.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3f7f7d61528e38fc14c23b13223680c/39600/cosine-similarity.png" alt="Cosine Similarity" title="Cosine Similarity" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Understanding Cosine Similarity
(<a href="https://www.oreilly.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="evgenii-munin---run-s3-locally-with-minio-for-the-dvc-machine-learning-pipeline" style="position:relative;"><strong>Evgenii Munin</strong> - Run S3 Locally With MinIO for the DVC Machine Learning Pipeline<a href="#evgenii-munin---run-s3-locally-with-minio-for-the-dvc-machine-learning-pipeline" aria-label="evgenii munin run s3 locally with minio for the dvc machine learning pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you are in need of object storage to work with data through an API, but need
to do so in a private network,
<a href="https://www.linkedin.com/in/evgenii-munin-01932a143/" target="_blank" rel="nofollow noopener noreferrer"><strong>Evgenii Munin</strong></a> shows
how to set up MinIO as remote storage with DVC to do just that
<a href="https://betterprogramming.pub/run-s3-locally-with-minio-for-dvc-machine-learning-pipeline-7fa3d240d3ab" target="_blank" rel="nofollow noopener noreferrer">in this piece in Medium</a>.
In this cool use case, he starts with installing the MinIO server and builds a
Docker image to run it, sharing a great repo on Kafka-to S3 where MinIO was used
to mock the S3 for the data. Then he shows you how to link the MinIO server as
DVC remote storage.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3122dc4b5f05f5669546d3d8fe06f7d2/39600/minio.png" alt="Minio Browser" title="Minio Browser" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Minio Browser
with Data pushed from DVC
(<a href="https://betterprogramming.pub/run-s3-locally-with-minio-for-dvc-machine-learning-pipeline-7fa3d240d3ab" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="caleb-kaiser---moving-from-data-science-to-machine-learning-engineering" style="position:relative;"><strong>Caleb Kaiser</strong> - Moving from Data Science to Machine Learning Engineering<a href="#caleb-kaiser---moving-from-data-science-to-machine-learning-engineering" aria-label="caleb kaiser moving from data science to machine learning engineering permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It can sometimes be confusing to determine where data science stops and machine
learning engineering starts. <a href="https://twitter.com/KaiserFrose" target="_blank" rel="nofollow noopener noreferrer"><strong>Caleb Kaiser</strong></a>
helps clarify this
<a href="https://www.kdnuggets.com/2020/11/moving-data-science-machine-learning-engineering.html" target="_blank" rel="nofollow noopener noreferrer">in this old but good piece</a>
in <a href="https://www.kdnuggets.com" target="_blank" rel="nofollow noopener noreferrer">KD Nuggets</a>. He provides four examples of real-
world projects and defines what portions of the project are data science and
what are ML engineering. In all what we find is that machine learning
engineering is all the tasks that need to happen to get the model the data
scientists create into production applications.</p>
<p>He goes on to dive deeper into one of the examples and shows the promise in some
tools that bridge the gap between machine learning and software engineering
where he highlights DVC and Huggingface. This is a good piece to read if you are
grappling with the difference!</p>
<p><img src="https://media.giphy.com/media/xUNd9DLukkavmhybAs/giphy.gif" alt="Season 2 Episode 6 GIF by Portlandia"></p>
<h2 id="just-a-few-other-things" style="position:relative;">Just a few other things…<a href="#just-a-few-other-things" aria-label="just a few other things permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>GitHub Goodness alert for
<a href="https://github.com/instill-ai/vdp" target="_blank" rel="nofollow noopener noreferrer">Visual Data Preparation (VDP),</a> an
open-source visual data ETL tool to streamline the end-to-end visual data
processing pipeline. Among the highlights: a fast way to build end-to-end
visual data pipelines, pre-built ETL data connectors, and integration with DVC</li>
<li><a href="https://twitter.com/jillianerowe" target="_blank" rel="nofollow noopener noreferrer"><strong>Jillian Rowe</strong></a> gave a shout-out to DVC
on a
<a href="https://topenddevs.com/podcasts/adventures-in-devops/episodes/the-intersection-of-data-and-devops-devops-124" target="_blank" rel="nofollow noopener noreferrer">recent podcast</a>
from
<a href="https://topenddevs.com/podcasts/adventures-in-devops" target="_blank" rel="nofollow noopener noreferrer">Adventures in DevOps Podcast</a>
in an episode where they discuss the intersection of data and DevOps</li>
<li>If you are interested in contributing to researchers' learning about machine
learning experimentation tools, you can take
<a href="https://www.freelancer.com.au/projects/machine-learning/Seeking-Qualified-Respondents-for-Online-34294453.html" target="_blank" rel="nofollow noopener noreferrer">this survey</a>.
Spread the word!</li>
</ul>
<h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="-model-registry-released-in-iterative-studio" style="position:relative;">🎉 Model Registry released in Iterative Studio<a href="#-model-registry-released-in-iterative-studio" aria-label=" model registry released in iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>On July 26th we released our new
<a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer">model registry in Iterative Studio.</a><br>
The great work done by the MLEM team building a git-based model registry is now
incorporated in Studio in a web UI. This release took the work of half the
people in the company and we are proud of the steps we are taking to meet people
where they are and round out your options whether you are comfortable in the
CLI, API, or web UI. Be sure to try it out and give us your feedback. Learn more
<a href="https://dvc.org/blog/iterative-studio-model-registry" target="_blank" rel="nofollow noopener noreferrer">in the blog post</a> and
<a href="https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry" target="_blank" rel="nofollow noopener noreferrer">in the docs</a>.
Look out for a full tutorial coming soon!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DYeVI-QrHGI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="-iteratives-first-internal-hackathon" style="position:relative;">🧑🏽💻 Iterative's First Internal Hackathon<a href="#-iteratives-first-internal-hackathon" aria-label=" iteratives first internal hackathon permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Last week we had our very first internal Hackathon! The entire company
participated in the 48-hour computer vision challenge classifying dogs, cats,
croissants and muffins. Part of the objective was to familiarize ourselves and
test a new tool that we are expecting to release later this year.</p>
<p>Eight teams competed for prizes for the best outcome, but also for the best
integrations with other tools, the best dog, cat, croissant, and muffin photos
from team members, and the best notes from the experience. I think the notes of
our newest DevRel <a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> are
in good running for the prize. (Learn more about Gema in the New Hires section
below!)</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/798cc42eba6e04b60100fb9f4f5d0d4f/03346/gema-hackathon-notes.jpg" alt="Gema Parreño Piqueras' Hackathon notes" title="Gema Parreño Piqueras' Hackathon notes" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Gema Parreño Piqueras' Hackathon notes
(<a href="https://twitter.com/SoyGema/status/1558135976698028034?s=20&t=lXyAWLISwf8gUl8SZS84AQ" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>See the members of the winning teams below. Team members
<a href="https://www.linkedin.com/in/danielkharitonov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Kharitonov</strong></a> and
<a href="https://www.linkedin.com/in/jon-burdo-59730a83/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jon Burdo</strong></a> organized the
whole event and put together an extremely comprehensive document to help guide
the teams. We are looking forward to more of these events in the future!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7f9d897ef06e99ea1e5f9cddc70b8413/03346/winners.jpg" alt="Winners of the First Iterative Hackathon" title="Winners of the First Iterative Hackathon" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Winners of the first Iterative Internal Hackathon, Source: Dmitry Petrov</em></p>
<h3 id="-dmitry-petrov-in-ai-techpark-and-the-new-stack" style="position:relative;">📰 Dmitry Petrov in AI Techpark and The New Stack<a href="#-dmitry-petrov-in-ai-techpark-and-the-new-stack" aria-label=" dmitry petrov in ai techpark and the new stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> gives a sneak peek into the
recent developments at Iterative.ai, highlights the most exciting trends, and
shares about his entrepreneurial journey
<a href="https://ai-techpark.com/aitech-interview-with-dmitry-petrov-co-founder-ceo-at-iterative-ai/" target="_blank" rel="nofollow noopener noreferrer">in this article</a>
in <a href="https://ai-techpark.com/ai/" target="_blank" rel="nofollow noopener noreferrer">AI Techpark.</a></p>
<p>Dmitry also wrote a piece for <a href="https://thenewstack.io/" target="_blank" rel="nofollow noopener noreferrer">The NewStack</a> entitled
<a href="https://thenewstack.io/why-we-built-an-open-source-ml-model-registry-with-git/" target="_blank" rel="nofollow noopener noreferrer">Why We Built an Open Source ML Model Registry with Git</a>.
As the title suggests the why is here as well as learnings from our customers'
use cases, and the realization of the need for Model Registry as Code (MRaC),
thus continuing our GitOps approach to tool building for machine learning.</p>
<h2 id="david-de-la-iglesia-castro---making-mlops-uncool-again" style="position:relative;"><strong>David de la Iglesia Castro</strong> - Making MLOps Uncool Again<a href="#david-de-la-iglesia-castro---making-mlops-uncool-again" aria-label="david de la iglesia castro making mlops uncool again permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you haven't gotten a chance to make it to the conferences where
<a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> presented
his popular talk or workshop entitled
<a href="https://www.youtube.com/watch?v=J6fduKE1j1g" target="_blank" rel="nofollow noopener noreferrer">Making MLOps Uncool Again</a>, you
can now catch it on our very own
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>! In
this presentation you will learn how to build an MLOps workflow by extending the
power of Git and GitHub with open-source tools DVC and CML. In the end, you will
have an automated workflow that covers the entire lifecycle of an ML model, from
data labeling to monitoring predictions.
<a href="https://github.com/iterative/workshop-uncool-mlops" target="_blank" rel="nofollow noopener noreferrer">Find the repo for the project here.</a>
And the
<a href="https://github.com/iterative/workshop-uncool-mlops-solution" target="_blank" rel="nofollow noopener noreferrer">solution here</a>.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/J6fduKE1j1g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="new-hires" style="position:relative;">New hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> joins our team from
Madrid, Spain as a Developer Advocate. You may have already been familiar with
Gema if you've been taking our <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">online course</a> this
summer because of the
<a href="https://twitter.com/SoyGema/status/1558135976698028034?s=20&t=pJAfd-S4aoKGf4UhsnlgCw" target="_blank" rel="nofollow noopener noreferrer">gorgeous notes</a>
she contributed per module. Gema was born and raised as an Architect (of
buildings) but switched to tech a while back. She had her own video game
start-up and has also worked as a Data Scientist in the Financial Industry. She
has contributed to open source StarCraft II ML project. Gema loves indie games,
puzzles, and croquettes! She makes the 4th teammate from España! 🇪🇸</p>
<p><a href="https://www.linkedin.com/in/marcinjasion/" target="_blank" rel="nofollow noopener noreferrer"><strong>Marcin Jasion</strong></a> joins the team as
a Senior Platform Engineer from Poland. He has been friends with team member,
Paweł Redzyński, for years. When not working he likes travelling and eating,
motorcycling, and is an avid cross-fitter. He also has a cat that likes to be a
part of meetings! 🐈</p>
<p><a href="https://www.linkedin.com/in/domasmonkus/" target="_blank" rel="nofollow noopener noreferrer"><strong>Domas Monkus</strong></a> joins the CML team
as an engineer from Lithuania. Before joining us at Iterative, Domas spent 10
years at Canonical working on juju, livepatch, and many internal projects. He's
a husband and father with a house outside the hustle and bustle of the city, so
he mentioned that lawn mowing is one of his main free time activities. 🏡</p>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This week is <a href="https://ai4.io/" target="_blank" rel="nofollow noopener noreferrer">AI4</a>!
<a href="https://twitter.com/fullstackml?lang=en" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will give a talk as
well as participate in a panel discussion on MLOps. If you are attending, stop
by the booth and say hi or check out one of the in-booth demos we will have on
our tools throughout the day.</p>
<p>Additional conferences we will be attending this year:</p>
<ul>
<li><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> and our lead docs
writer, <a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Orpinel Perez</strong></a> will be
heading to Mexico City August 31-September 1st for the
<a href="https://www.latam-ai.com/" target="_blank" rel="nofollow noopener noreferrer">LATAM AI Conference</a>. Gema will give a
presentation on experimentation in our new
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>.</li>
<li><a href="https://www.southerndatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Southern Data Science Conference</a> in
Atlanta, GA on September 8-9th.</li>
<li><a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco</li>
<li><a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin</li>
<li><a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> -
London</li>
<li>Dmitry Petrov will be speaking at
<a href="https://www.githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on November 9-10!</li>
<li><a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a>-
Toronto</li>
</ul>
<p>We also will be reviving our virtual meetups this fall so be sure to
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">join our group on Meetup.</a></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the open positions. Please share with anyone looking to
have a lot of fun building the next generation of machine learning to production
tools! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b06f20b39d5f8146f4baadac1aa90e0b/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>As noted above there are
<a href="https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry" target="_blank" rel="nofollow noopener noreferrer">new docs for Iterative Studio's Model Registry</a></li>
<li>In case you missed it, CML now supports
<a href="https://bitbucket.org/product" target="_blank" rel="nofollow noopener noreferrer">Bitbucket</a>! You can find the
<a href="https://cml.dev/doc/start/bitbucket#get-started-with-cml-on-bitbucket" target="_blank" rel="nofollow noopener noreferrer">docs for the Bitbucket integration here</a>.</li>
</ul>
<h3 id="-blog-post" style="position:relative;">✍🏼 Blog post<a href="#-blog-post" aria-label=" blog post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>💎 Don't miss
<a href="https://dvc.org/blog/july-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">July's Community Gems</a> is full
of great questions from the Community.</li>
<li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> provides a new
tutorial for
<a href="https://dvc.org/blog/serving-models-with-mlem" target="_blank" rel="nofollow noopener noreferrer">Serving Machine Learning Models with MLEM.</a>
Don't miss it!</li>
</ul>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Once again we have a tie for best Tweet! Looking forward to seeing the video on
this one from <a href="https://twitter.com/AvikalpGupta" target="_blank" rel="nofollow noopener noreferrer"><strong>Avikalp Kumar Gupta</strong></a>!🍿 You
can find the slides
<a href="https://drive.google.com/file/d/1-iOgtVDWG13A9MxRDet246Gnbdrkb0vv/view" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/microwin?src=hash&ref_src=twsrc%5Etfw">#microwin</a> of the day:<br><br>Spoke at <a href="https://twitter.com/hashtag/GCCDBLR?src=hash&ref_src=twsrc%5Etfw">#GCCDBLR</a> '22 (annual flagship event by <a href="https://twitter.com/gdgcblr">@gdgcblr</a>) about setting up effective <a href="https://twitter.com/hashtag/DataScience?src=hash&ref_src=twsrc%5Etfw">#DataScience</a> teams. And shared with everyone, how tools like <a href="https://twitter.com/hashtag/git?src=hash&ref_src=twsrc%5Etfw">#git</a> <a href="https://twitter.com/github">@github</a> <a href="https://twitter.com/DVCorg">@DVCorg</a> <a href="https://twitter.com/ProjectJupyter">@ProjectJupyter</a> Jupytext and <a href="https://twitter.com/vibinex">@vibinex</a> can make it easier.<a href="https://twitter.com/hashtag/technology?src=hash&ref_src=twsrc%5Etfw">#technology</a> <a href="https://twitter.com/hashtag/startup?src=hash&ref_src=twsrc%5Etfw">#startup</a> <a href="https://twitter.com/hashtag/day38?src=hash&ref_src=twsrc%5Etfw">#day38</a> <a href="https://t.co/GBLXa9OGAO">pic.twitter.com/GBLXa9OGAO</a></p>— Avikalp Kumar Gupta (@AvikalpGupta) <a href="https://twitter.com/AvikalpGupta/status/1556609442908884994">August 8, 2022</a></blockquote>
<p>Also so great to have our new DVC extension shouted out by
<a href="https://twitter.com/HaroldSinnott" target="_blank" rel="nofollow noopener noreferrer"><strong>Harold Sinnot</strong></a>!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">10 VScode extensions every data scientist should have💻🤖<br><br>1. Python<br>2. Pylance<br>3. Python Indent<br>4. Jupyter<br>5. Jupyter notebook renderers<br>6. DVC - (ML model experiment tracking)<br>7. Gitlens<br>8. Todo MD<br>9. Excel viewer<br>10. Markdown preview GitHub styling<br><br>via <a href="https://twitter.com/avikumart_">@avikumart_</a> <a href="https://twitter.com/hashtag/AI?src=hash&ref_src=twsrc%5Etfw">#AI</a> <a href="https://twitter.com/hashtag/IoT?src=hash&ref_src=twsrc%5Etfw">#IoT</a></p>— Harold Sinnott 🇺🇸 (@HaroldSinnott) <a href="https://twitter.com/HaroldSinnott/status/1545058509087092736">July 7, 2022</a></blockquote>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/july-22-community-gemshttps://dvc.org/blog/july-22-community-gemsTue, 26 Jul 2022 00:00:00 GMT<h2 id="how-can-i-track-a-new-file-added-to-my-data-folder-if-the-data-folder-is-already-tracked-by-dvc-yet-ignored-by-git" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/983278896587894804" target="_blank" rel="nofollow noopener noreferrer">How can I track a new file added to my <code>data</code> folder if the <code>data</code> folder is already tracked by DVC, yet ignored by Git?</a><a href="#how-can-i-track-a-new-file-added-to-my-data-folder-if-the-data-folder-is-already-tracked-by-dvc-yet-ignored-by-git" aria-label="how can i track a new file added to my data folder if the data folder is already tracked by dvc yet ignored by git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Great question on how DVC handles data tracking from @NgHoangDat!</p>
<p>Since you already track the <code>data</code> folder, when you add a new file into it, all
you need to do is update your DVC history. You can use either <a href="https://dvc.org/doc/command-reference/add"><code>dvc add data</code></a> or
<a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> to start tracking the new file.</p>
<p>DVC will also only recalculate the changed files. If you add or modify a small
number of files in that folder, the update will not take very long.</p>
<h2 id="what-would-be-the-best-method-to-get-the-remote-url-of-a-given-dataset-inside-a-python-environment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/984870485668008007" target="_blank" rel="nofollow noopener noreferrer">What would be the best method to get the remote URL of a given dataset inside a Python environment?</a><a href="#what-would-be-the-best-method-to-get-the-remote-url-of-a-given-dataset-inside-a-python-environment" aria-label="what would be the best method to get the remote url of a given dataset inside a python environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Wonderful question from @come_arvis!</p>
<p>You can use the <code>get_url</code> method of the
<a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a> to do this. Here's an
example of a script you might run to get the remote URL.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api
resource_url <span class="token operator">=</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>get_url<span class="token punctuation">(</span>
<span class="token string">'get-started/data.xml'</span><span class="token punctuation">,</span>
repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/dataset-registry'</span>
<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>resource_url<span class="token punctuation">)</span>
<span class="token comment"># https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355</span></code></pre></div>
<p>This URL is built with the remote URL from the project configuration file,
<code>.dvc/config</code>, and the <code>md5</code> file hashes stored in the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file corresponding
to the data file or directory you want the storage location of.</p>
<h2 id="im-excited-about-mlem-helping-expose-api-endpoints-to-our-model-but-heard-it-was-experimental-where-can-i-learn-more-about-how-to-deploy-models-with-this-tool" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/992517466662117386" target="_blank" rel="nofollow noopener noreferrer">I'm excited about MLEM helping expose API endpoints to our model, but heard it was experimental. Where can I learn more about how to deploy models with this tool?</a><a href="#im-excited-about-mlem-helping-expose-api-endpoints-to-our-model-but-heard-it-was-experimental-where-can-i-learn-more-about-how-to-deploy-models-with-this-tool" aria-label="im excited about mlem helping expose api endpoints to our model but heard it was experimental where can i learn more about how to deploy models with this tool permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Great question from @raveman^2!</p>
<p>There are a few ways you can use expose API endpoints to your model:</p>
<ul>
<li>Run <code>mlem serve</code> to generate a FastAPI endpoint with your model.</li>
<li>Export the model as a Python package for your own custom-built API.</li>
<li>The experimental deploy to Heroku.</li>
</ul>
<p>You can find more details here in the MLEM docs: <a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">https://mlem.ai/doc/get-started</a></p>
<p>You can also see an example of deploying a model with MLEM in this
<a href="https://dvc.org/blog/serving-models-with-mlem" target="_blank" rel="nofollow noopener noreferrer">blog post tutorial</a>.</p>
<h2 id="how-do-i-revert-a-dvc-add-command-to-stop-tracking-data" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/993111134896918599" target="_blank" rel="nofollow noopener noreferrer">How do I revert a <code>dvc add</code> command to stop tracking data?</a><a href="#how-do-i-revert-a-dvc-add-command-to-stop-tracking-data" aria-label="how do i revert a dvc add command to stop tracking data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This is a good question from @Nwoke!</p>
<p>If you have accidentally added the wrong directory or files for DVC to track,
you can easily remove them with the <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> command. This is used to remove
the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file and ensure that the original data file is no longer being
tracked. Here's an example of this command being used:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remove</span> data.csv.dvc</span></code></pre></div>
<p>Sometimes when you stop tracking data, you also want to remove it from your
cache. You can do this with the <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> command, which will remove all data,
not just the target of <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a>. If you want to remove all of the data and
its previous versions from the cache, you can do that with the following
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">-w</span></span></code></pre></div>
<p>The <code>-w</code> option only keeps the files and directories referenced in the
workspace, so once you have removed the data you don't want to track, this is
how DVC knows what to keep and what to discard.</p>
<p>You can learn more about removing tracked data in
<a href="https://dvc.org/doc/user-guide/how-to/stop-tracking-data" target="_blank" rel="nofollow noopener noreferrer">the docs here</a>.</p>
<h2 id="is-it-normal-for-the-outs-of-a-stage-to-be-removed-when-dvc-repro-is-run" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/993781745524691087" target="_blank" rel="nofollow noopener noreferrer">Is it normal for the <code>outs</code> of a stage to be removed when <code>dvc repro</code> is run?</a><a href="#is-it-normal-for-the-outs-of-a-stage-to-be-removed-when-dvc-repro-is-run" aria-label="is it normal for the outs of a stage to be removed when dvc repro is run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Fantastic question from @Nish!</p>
<p>This is the expected behavior of DVC. It removes the <code>outs</code> of a stage unless
the <code>persist:true</code> value is set for that output. You can learn more about how
this works in
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#output-subfields" target="_blank" rel="nofollow noopener noreferrer">our docs here</a>.
Here's an example of a stage with the <code>persist</code> value set.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> date <span class="token punctuation">></span> data/external/date
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">data/external</span><span class="token punctuation">:</span>
<span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div>
<p>Even if you don't persist your <code>outs</code>, you can still check out an older version
of the pipeline to get older <code>outs</code> with <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>. This is based on what's
in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files and it will update your workspace to match
the experiment you check out. This is usually run after checking out a different
Git branch. So the flow might look like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> experiment-branch
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div>
<p>These commands allow you to get the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files for the
experiment you want to go back to from your Git history. Then it uses DVC to get
your data to the version you want and reproduce your entire experiment. You can
learn more about these details in
<a href="https://dvc.org/doc/command-reference/checkout" target="_blank" rel="nofollow noopener noreferrer">the <code>dvc checkout</code> docs here</a>.</p>
<h2 id="is-there-a-way-to-have-a-plot-with-multiple-y-axes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/994685566698410055" target="_blank" rel="nofollow noopener noreferrer">Is there a way to have a plot with multiple y-axes?</a><a href="#is-there-a-way-to-have-a-plot-with-multiple-y-axes" aria-label="is there a way to have a plot with multiple y axes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Wonderful question from @shortcipher3!</p>
<p>If you update DVC to version <code>2.12.1</code> and higher, you should be able to define
multiple y-axes in your DVC pipeline. Here's an example of how this may look in
a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token punctuation">...</span>
<span class="token key atrule">plots</span><span class="token punctuation">:</span>
<span class="token key atrule">some_file.csv</span><span class="token punctuation">:</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name
<span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">,</span> col3<span class="token punctuation">]</span>
<span class="token comment"># alternative 1:</span>
<span class="token key atrule">multiple_rocs</span><span class="token punctuation">:</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name
<span class="token key atrule">y</span><span class="token punctuation">:</span>
<span class="token key atrule">some_file.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">,</span> col3<span class="token punctuation">]</span>
<span class="token comment"># in case of multiple files:</span>
<span class="token key atrule">multiple_rocs_from_multiple_files</span><span class="token punctuation">:</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name
<span class="token key atrule">y</span><span class="token punctuation">:</span>
<span class="token key atrule">file1.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">]</span>
<span class="token key atrule">file2.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col3<span class="token punctuation">]</span></code></pre></div>
<p>A quick note, make sure that <code>plots</code> is on the same level as <code>stages</code> in your
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p>
<h2 id="how-do-you-structure-the-dvcyaml-file-to-run-in-stages-in-a-specific-order" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/991000853278232616" target="_blank" rel="nofollow noopener noreferrer">How do you structure the <code>dvc.yaml</code> file to run in stages in a specific order?</a><a href="#how-do-you-structure-the-dvcyaml-file-to-run-in-stages-in-a-specific-order" aria-label="how do you structure the dvcyaml file to run in stages in a specific order permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Awesome question from @srb302!</p>
<p>You would need to set up outputs and dependencies for each stage. So a stage
that is run first would generate an output and the stage that is suppose to run
second would use the first stage's output as a dependency.</p>
<p>Otherwise, DVC does not guarantee any particular execution order for stages
which are independent of each other. DVC determines the structure of your DAG
based on file outputs and dependencies and there isn't another way to enforce
order of stage execution in DVC.</p>
<h2 id="how-do-i-know-when-i-should-track-a-file-with-git-or-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/993120910095699978" target="_blank" rel="nofollow noopener noreferrer">How do I know when I should track a file with Git or DVC?</a><a href="#how-do-i-know-when-i-should-track-a-file-with-git-or-dvc" aria-label="how do i know when i should track a file with git or dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This is a really good question from @vadim.sukhov!</p>
<p>Let's take a look at an example <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">evaluate</span><span class="token punctuation">:</span>
<span class="token punctuation">...</span>
<span class="token key atrule">plots</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">prc.json</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> recall
<span class="token key atrule">y</span><span class="token punctuation">:</span> precision
<span class="token punctuation">-</span> <span class="token key atrule">roc.json</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> fpr
<span class="token key atrule">y</span><span class="token punctuation">:</span> tpr</code></pre></div>
<p>In this scenario, the <code>prc.json</code> and <code>roc.json</code> files are <strong>not</strong> being tracked
by DVC because of the <code>cache: false</code> value. Since these files aren't tracked by
DVC, they aren't saved to a remote storage location outside of Git, like data
files are. So if you have <code>cache: false</code> on a file that you want to keep track
of, you'll need to Git commit them to your project.</p>
<hr>
<p><img src="https://media.giphy.com/media/pdSncNyYgaH0wqaCqp/giphy.gif" alt="Duck Dynasty GIF by DefyTV"></p>
<p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date
with us to find out what it is!
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a>
to stay up to date with specifics as we get closer to the event!</p>
<p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC, CML, and MLEM
questions answered!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the
community!</p>https://dvc.org/blog/iterative-studio-model-registryhttps://dvc.org/blog/iterative-studio-model-registryTue, 26 Jul 2022 00:00:00 GMT<p>Machine learning tasks are iterative by nature. Over time, you build several
versions of your ML models, which could be in different stages of
production-readiness. A version may be running in production, another version
that seems to perform better may be in staging, and a couple more versions could
be in active development by you and your teammates - using updated
hyperparameters, datasets, or algorithms.</p>
<p>How do you keep track of all your models, their versions, and deployment
statuses? How do you get answers to questions like these easily:</p>
<ul>
<li>Which model version is currently in production?</li>
<li>When was the last time this model was updated?</li>
</ul>
<p>If you are like some of the data scientists we know, you may have a Google sheet
or a Notion page with the list of all your models, their changes, deployment
history, and so on. But this is highly error-prone and will probably get
out-of-date very quickly. Or maybe you upload all your models to a cloud bucket
and “attach” text reports to them. Not very maintainable or searchable either.
We’ve even seen people use sticky notes, or better yet, rely on their memory 😀.</p>
<p>Some of the more organized folks use Model Registries - tools created
specifically to organize models into a central, searchable repository. While
this is definitely better than using random files or sticky notes, one major
problem persists: the data science and machine learning team members work
completely isolated from the software development and DevOps team members. This
makes collaboration far more time consuming than it should be.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/408b5d1f21c83dd2dba8cc35b40238b6/39600/disconnected-silos.png" alt="Teams can work in disconnected silos" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Some even implement in-house systems, and maybe you are also planning to do so.
But these can get expensive to develop and maintain.</p>
<p><strong>We built the Iterative Studio Model Registry to solve these problems.</strong></p>
<p>Iterative Studio Model Registry enables ML teams to collaborate on models by
providing model organization, discovery, versioning, lineage (tracing the origin
of the model), and the ability to manage deployment statuses such as,
development, staging, and production across multiple projects.</p>
<h2 id="utilize-your-existing-git-infrastructure" style="position:relative;">Utilize your existing Git infrastructure<a href="#utilize-your-existing-git-infrastructure" aria-label="utilize your existing git infrastructure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Iterative Studio Model Registry is built on top of Git, which means:</p>
<ul>
<li>You can reuse your existing Git infrastructure to manage ML models together
with code, data, experiment pipelines, and deployment statuses.</li>
<li>You can use GitOps for model deployment, and trigger model deployment from
Iterative Studio, which you can also use to run your ML experiments.</li>
<li>DS/ML folks and Software/DevOps folks can work together more easily, because
they utilize the same tools and infrastructure.</li>
</ul>
<h2 id="open-mlops" style="position:relative;">Open MLOps<a href="#open-mlops" aria-label="open mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A core philosophy at Iterative is open MLOps - we build tools that work with
your infrastructure. Our toolstack is modular, so you can build your model
registry on top of your existing cloud and DevOps infrastructure.</p>
<p>Under the hood, Iterative Studio Model Registry uses Iterative’s open-source
Git-based tools <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> and <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p>
<ul>
<li><a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> enables <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer">semantic versioning</a> and stage transitions of artifacts
using metadata files and Git tags.</li>
<li><a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> saves ML models and extracts model metadata including framework,
methods, input / output data schema, and requirements.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/babf62e08455b02dae8de67684bc7a65/39600/modular-toolstack.png" alt="Iterative toolstack is modular" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="ui-of-your-choice" style="position:relative;">UI of your choice<a href="#ui-of-your-choice" aria-label="ui of your choice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Iterative Model Studio Registry meets you where you are, through your favorite
interface. Whether you like APIs, prefer a web interface, or work best in the
command line; whatever your role or preference, we've got you covered so your
team can be most efficient.</p>
<h2 id="models-can-reside-anywhere" style="position:relative;">Models can reside anywhere<a href="#models-can-reside-anywhere" aria-label="models can reside anywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Save your model files wherever works best for you, whether it’s in S3, GCP, or
any other of your remote (or local) storages. Then, add them to the model
registry in a non-intrusive, no-code fashion <strong>without modifying your ML
training code</strong>. This saves you hours of valuable time.</p>
<h2 id="collaborate-across-multiple-projects" style="position:relative;">Collaborate across multiple projects<a href="#collaborate-across-multiple-projects" aria-label="collaborate across multiple projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A central dashboard of all your models facilitates transparency and discovery
across every project by your whole team.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d998d5277de4d7bf4f64506384f7c134/39600/models-dashboard.png" alt="Models are organized in a central dashboard" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>And on the model details page, you’ll find that information about the model is
automatically detected and its history tracked.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/621a6f181d9bee2ab9071b9be4f845df/39600/model-details-page.png" alt="All models have separate model detail pages" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<admon type="tip">
<p>Try our <a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">demo Model Registry</a>
to get a feel for Iterative Studio's Model Registry features.</p>
</admon>
<h2 id="create-model-versions-and-stages-from-any-git-commit" style="position:relative;">Create model versions and stages from any Git commit<a href="#create-model-versions-and-stages-from-any-git-commit" aria-label="create model versions and stages from any git commit permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For registering versions, select the commit and provide the version number. To
assign stages, select the version and provide the stage name. It is as simple as
that.</p>
<h2 id="git-remains-the-single-source-of-truth-for-all-your-ml-projects" style="position:relative;">Git remains the single source of truth for all your ML projects<a href="#git-remains-the-single-source-of-truth-for-all-your-ml-projects" aria-label="git remains the single source of truth for all your ml projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here’s a brief explanation of how the model, version and stage information is
stored in Git:</p>
<ul>
<li>The following entry in <code>artifacts.yaml</code> indicates that your <code>image-synthesis</code>
model is stored in an <code>S3</code> bucket.</li>
</ul>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">image-classifier-model</span><span class="token punctuation">:</span>
<span class="token key atrule">description</span><span class="token punctuation">:</span>
This model is used to classify images of different objects submitted by
users. This version of the model has an accuracy of 95%.
<span class="token key atrule">labels</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> Random Forest
<span class="token punctuation">-</span> image classification
<span class="token punctuation">-</span> sklearn
<span class="token key atrule">path</span><span class="token punctuation">:</span> .mlem/model/image<span class="token punctuation">-</span>classifier<span class="token punctuation">-</span>model
<span class="token key atrule">type</span><span class="token punctuation">:</span> model</code></pre></div>
<p>In the following example, the Git tag <code>[email protected]</code> indicates
that you created version <code>2.0.0</code> of your <code>image-classifier-model</code> from the Git
commit <code>6c0fc85</code>.</p>
<p>The Git tag <code>image-classifier-model#production#3</code> indicates that you assigned
the <code>production</code> stage to version <code>2.0.0</code> of your model.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 394px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b8b68153a471f89bb3276775b475e26d/39600/git-tags.png" alt="Git tags represent model version and stage" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="a-single-platform-for-all-your-mlops-needs" style="position:relative;">A single platform for all your MLOps needs<a href="#a-single-platform-for-all-your-mlops-needs" aria-label="a single platform for all your mlops needs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Since its inception, Iterative Studio has brought together <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, and
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> for seamless data and model management, experiment tracking, visualization
and automation. Now, by harnessing the power of <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> and <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> in its Model
Registry, it makes your machine learning processes even more robust.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>With the Iterative Studio Model Registry, your ML model (dis)organization is not
in chaos anymore. Collaborating on your ML projects becomes faster and your ML
team members’ lives become much easier.</p>
<p>Start using <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio Model Registry</a> today. And answer all the who,
what, why, where and when questions of your team's model production directly
from the information in your Git repository.</p>
<p>Refer to the <a href="https://dvc.org/doc/studio/user-guide/model-registry" target="_blank" rel="nofollow noopener noreferrer">documentation and tutorials</a> to get started. To request
support or share feedback, you can <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email me</a> or create a support ticket on
<a href="https://github.com/iterative/studio-support" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DYeVI-QrHGI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>https://dvc.org/blog/serving-models-with-mlemhttps://dvc.org/blog/serving-models-with-mlemTue, 19 Jul 2022 00:00:00 GMT<p>Training a machine learning model is only one step in the process of getting
something useful out to end-users. When it's time to deploy the model to
production, there are a number of approaches you can take depending on the goal
of the machine learning project. That might mean getting the model ready to
respond to real-time queries coming from an API or batch processing predictions,
for example.</p>
<p>Either way, you'll need to save your trained and validated model in a format
that's consumable by other systems. That's why we'll be covering how to serve
models through a <a href="https://restfulapi.net/" target="_blank" rel="nofollow noopener noreferrer">REST</a> endpoint or a Python package
with <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p>
<blockquote>
<p>You can get the repo we're working with
<a href="https://github.com/iterative/stale-model-example/tree/mlem-serve" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
</blockquote>
<h2 id="take-a-candidate-model" style="position:relative;">Take a candidate model<a href="#take-a-candidate-model" aria-label="take a candidate model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are instructions in the project
<a href="https://github.com/iterative/stale-model-example/tree/mlem-serve#readme" target="_blank" rel="nofollow noopener noreferrer">README</a>
on how to get everything you need installed and running. This is a simple ML
project that uses <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> for data versioning and experiment
tracking.</p>
<p>After you have the repo set up, you'll already have the <code>mlem</code> package
installed. This project already has a model that's been trained and validated so
we can move on to saving this model.</p>
<h2 id="save-the-model" style="position:relative;">Save the model<a href="#save-the-model" aria-label="save the model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Inside the <code>train.py</code> script, we need to add the <code>mlem</code> import to save the
models as we experiment. We don't have to worry about running the training
script for this project since we have the model, but it's good to know what's
happening under the hood.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span>
<span class="token keyword">import</span> os
<span class="token keyword">import</span> pickle5 <span class="token keyword">as</span> pickle
<span class="token keyword">import</span> sys
<span class="token keyword">import</span> yaml
<span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> save
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>ensemble <span class="token keyword">import</span> RandomForestClassifier
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div>
<p>Then you can add the <code>save</code> function to the end of the training script.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span>
n_estimators<span class="token operator">=</span>n_est<span class="token punctuation">,</span> min_samples_split<span class="token operator">=</span>min_split<span class="token punctuation">,</span> n_jobs<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">,</span> random_state<span class="token operator">=</span>seed
<span class="token punctuation">)</span>
clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>x<span class="token punctuation">,</span> labels<span class="token punctuation">)</span>
save<span class="token punctuation">(</span>
clf<span class="token punctuation">,</span>
<span class="token string">"clf"</span><span class="token punctuation">,</span>
sample_data<span class="token operator">=</span>x<span class="token punctuation">,</span>
description<span class="token operator">=</span><span class="token string">"Random Forest Classifier"</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span></code></pre></div>
<admon type="tip">
<p>You don't have to do these steps as we already have a model available, but if
you want to see the training and evaluation steps in action, you reproduce the
DVC pipeline with:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div>
<p>You can check out what is happening in that pipeline by looking at the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file in the project.</p>
<p>You can also see where we load the model into the <code>src/evaluate.py</code> script. To
do that, you'll need to add the following import.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># evaluate.py</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token keyword">import</span> pickle5 <span class="token keyword">as</span> pickle
<span class="token keyword">import</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">as</span> metrics
<span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> <span class="token builtin">apply</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div>
<p>Now we can use the <a href="https://mlem.ai/doc/api-reference/apply" target="_blank" rel="nofollow noopener noreferrer"><code>apply</code> function</a>
to make predictions with the model.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># evaluate.py</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
x <span class="token operator">=</span> matrix<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token number">11</span><span class="token punctuation">]</span><span class="token punctuation">.</span>values
cleaned_x <span class="token operator">=</span> np<span class="token punctuation">.</span>where<span class="token punctuation">(</span>np<span class="token punctuation">.</span>isnan<span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> x<span class="token punctuation">)</span>
labels_pred <span class="token operator">=</span> <span class="token builtin">apply</span><span class="token punctuation">(</span>model_file<span class="token punctuation">,</span> cleaned_x<span class="token punctuation">,</span> method<span class="token operator">=</span><span class="token string">"predict"</span><span class="token punctuation">)</span>
predictions_by_class <span class="token operator">=</span> <span class="token builtin">apply</span><span class="token punctuation">(</span>model_file<span class="token punctuation">,</span> cleaned_x<span class="token punctuation">,</span> method<span class="token operator">=</span><span class="token string">"predict_proba"</span><span class="token punctuation">)</span>
predictions <span class="token operator">=</span> predictions_by_class<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">]</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div>
<p>The <code>predict</code> and <code>predict_proba</code> are methods available from the model and they
are used to get new predicted values and their probabilities for evaluation.
This, along with everything else in the script, is how we get the metrics for
each experiment.</p>
</admon>
<p>After you run an experiment, there will be two new files in your repo: <code>clf</code> and
<code>clf.mlem</code>. Make sure you add the <code>clf.mlem</code> file to your Git history with the
following command:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> clf.mlem</span></code></pre></div>
<p>This is so that the metadata is in your repo and ready to use with other MLEM
commands. Now we can finally take the model file and ship it to production!</p>
<h2 id="deploy-the-model-to-production" style="position:relative;">Deploy the model to production<a href="#deploy-the-model-to-production" aria-label="deploy the model to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are a couple of ways you can do this with MLEM:</p>
<ul>
<li>Serve the model with <a href="https://fastapi.tiangolo.com/" target="_blank" rel="nofollow noopener noreferrer">FastAPI</a>.</li>
<li>Create a Python package (and use or distribute it).</li>
</ul>
<p><em>Note:</em> There is an experimental option to
<a href="https://mlem.ai/doc/get-started/deploying" target="_blank" rel="nofollow noopener noreferrer">deploy the model directly to Heroku</a>
although this functionality is experimental and may have breaking changes.</p>
<h3 id="serve-with-fastapi" style="position:relative;">Serve with FastAPI<a href="#serve-with-fastapi" aria-label="serve with fastapi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you don't have an API to work with and don't need a Python package, like if
you're just testing a model, you can serve your model quickly using FastAPI with
this command.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem serve</span> clf fastapi</span></code></pre></div>
<p>This will run a local server and spin up a web API for you so you can quickly
test out your model without needing a development team to work on the API
initially.</p>
<p>You'll see an output like this in your terminal:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem serve</span> clf fastapi
</span>⏳️ Loading model from clf.mlem
Starting fastapi server...
💅 Adding route for /predict
💅 Adding route for /predict_proba
💅 Adding route for /sklearn_predict
💅 Adding route for /sklearn_predict_proba
Checkout openapi docs at <http://0.0.0.0:8080/docs>
INFO: Started server process [31916]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre></div>
<p>Then, when you go to the local URL, you'll see the
<a href="https://fastapi.tiangolo.com/features/#automatic-docs" target="_blank" rel="nofollow noopener noreferrer">documentation</a> for how
to use the model you created.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7289114fcbec3359f45be08e125104f2/39600/fastapi_docs.png" alt="FastAPI ML model deployment" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>That's it! Now you know how to train a model, save it, and deploy to some
external service quickly using MLEM!</p>
<h3 id="custom-python-package" style="position:relative;">Custom Python package<a href="#custom-python-package" aria-label="custom python package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let's take a look at making a Python package and importing it into a
<a href="https://flask.palletsprojects.com/en/2.1.x/" target="_blank" rel="nofollow noopener noreferrer">Flask</a> web app. To make the Python
package, we'll run the following MLEM command.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem build</span> clf pip <span class="token parameter variable">-c</span> <span class="token assign-left variable">target</span><span class="token operator">=</span>build/ <span class="token parameter variable">-c</span> <span class="token assign-left variable">package_name</span><span class="token operator">=</span>bike_predictor</span></code></pre></div>
<p>This takes our <code>clf.mlem</code> file and generates a Python package called
<code>bike_predictor</code> in the <code>build</code> directory. When you look in your project, you
should see that new <code>build</code> folder that has all of the files you need for an
independent Python package.</p>
<p>To build the package, you'll need to run the following command in the <code>build</code>
directory.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> build <span class="token parameter variable">--wheel</span></span></code></pre></div>
<p>Then go back to the top-level directory and run the following command to install
your new model package.</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> ./build/dist/bike_predictor-0.0.0-py3-none-any.whln</span></code></pre></div>
<p>Now you can import this to your Flask API like so.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># api.py</span>
<span class="token keyword">import</span> os
<span class="token keyword">from</span> flask <span class="token keyword">import</span> Flask<span class="token punctuation">,</span> jsonify<span class="token punctuation">,</span> request
<span class="token keyword">from</span> flask_sqlalchemy <span class="token keyword">import</span> SQLAlchemy
<span class="token keyword">from</span> flask_migrate <span class="token keyword">import</span> Migrate
<span class="token keyword">from</span> flask_cors <span class="token keyword">import</span> CORS
<span class="token keyword">from</span> dotenv <span class="token keyword">import</span> load_dotenv
<span class="token keyword">import</span> bike_predictor
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div>
<p>You can then use the <code>predict</code> method on new data and run any other tasks you
need to in the API.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># api.py</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
new_event <span class="token operator">=</span> EventsModel<span class="token punctuation">(</span>
title<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
date<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"date"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
time<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"time"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
location<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"location"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
description<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"description"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
<span class="token punctuation">)</span>
db<span class="token punctuation">.</span>session<span class="token punctuation">.</span>add<span class="token punctuation">(</span>new_event<span class="token punctuation">)</span>
db<span class="token punctuation">.</span>session<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span>
bike_predictor<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>new_event<span class="token punctuation">)</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div>
<p>Then you can test this API out locally by running the following command:</p>
<div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> src/api.py</span></code></pre></div>
<p>This will start up a local server on port 5000 and you'll be able to see your
model in action. From here, this can be deployed to any cloud environment as
long as you remember to include and install the model package.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this post, we learned how easy it can be to deploy a model through FastAPI or
through a Python package with MLEM. You can use this same process to train and
serve any model through an API endpoint very quickly. This can help with
validation, collaborating with team members, and it can help you see if there
are any underlying issues in your overall deployment process before you hear
about them from users. MLEM can also be used to create a model registry so you
can store and switch between models whenever you need to.</p>https://dvc.org/blog/july-heartbeathttps://dvc.org/blog/july-heartbeatMon, 18 Jul 2022 00:00:00 GMT<details>
<p>This month our cover image is inspired by a Community member
<a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer">Gift Ojebulu</a>. Gift is a champion of
Community and is a leader in the data movement in Nigeria. Recently he presented
about DVC at the
<a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer">Open Source Africa Conference</a>. He is also
extremely involved doing amazing work building the data Community in Africa
through <a href="https://datafestafrica.com/" target="_blank" rel="nofollow noopener noreferrer">Data Fest Africa</a>. We are lucky to have a
Gift as a member of our own Community.</p>
<summary>✨Image Inspo✨</summary>
</details>
<h1 id="first-an-apology" style="position:relative;">First an apology<a href="#first-an-apology" aria-label="first an apology permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>I first must share my sincere apologies. With all that was going on in the
Iterative Community last month, I ran out of time to finish the June Heartbeat.
With even more time passing there's lots to write about; let's do this!</p>
<p><img src="https://media.giphy.com/media/CzbiCJTYOzHTW/giphy.gif" alt="Send Tom Hanks GIF"></p>
<h2 id="mlem-release" style="position:relative;">MLEM Release<a href="#mlem-release" aria-label="mlem release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>On June 1st we released our latest open source tool in the Iterative ecosystem.
MLEM is a model registry and deployment tool connected to your Git repo.<br>
Together with <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>
(Git Tag Ops), MLEM helps you maintain a model registry right in your git
repository. Now we have one more step in the process of fully syncing together
the software development and machine learning worlds. To learn more about MLEM,
visit <a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">the website</a>,
<a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">⭐️ the repository</a>,
<a href="https://dvc.org/blog/DVC-VS-Code-extension" target="_blank" rel="nofollow noopener noreferrer">read the blog post</a>, or
<a href="https://youtu.be/a2Lc9kEgEM8" target="_blank" rel="nofollow noopener noreferrer">watch the video</a> of
<a href="https://github.com/mike0sv" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Svehnikov's</strong></a> full presentation and demo on
MLEM at our Release Party.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/a2Lc9kEgEM8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>If pressed for time you can also catch a shorter version of the presentation
with <a href="https://www.linkedin.com/in/agrigorev/" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> of
<a href="https://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">Data Talks Club</a>
<a href="https://www.youtube.com/watch?v=QQZUy0kSzOk" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h2 id="mlops-world-2022" style="position:relative;">MLOps World 2022<a href="#mlops-world-2022" aria-label="mlops world 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>I started writing this Heartbeat on the plane heading back from
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> in Toronto. This conference was a real
treat! It was wonderful to meet so many Community members already using DVC and
also to see conference talks advocating for our tools that we didn't even know
were going to happen! Many thanks to <a href="https://www.interos.ai/" target="_blank" rel="nofollow noopener noreferrer">Interos'</a>
<a href="https://www.linkedin.com/in/stephanrb3/" target="_blank" rel="nofollow noopener noreferrer"><strong>Stephen Brown</strong></a> and
<a href="https://www.linkedin.com/in/amybachir/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amy Bachir</strong></a> for sharing about DVC
and CML in the talk, <em>A GitOps Approach to Machine Learning.</em></p>
<p>Additionally, it was great to finally meet in person all the people from the
greater MLOps Community that I'd previously only known virtually including
<a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer"><strong>Demetrios Brinkman</strong></a> of
<a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps Community Slack</a>, our friends from
<a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a>, and <a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">Tryo-Labs</a>, and one
of our Community Champions
<a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> who
presented at one of our most engaging meetups on record, asking the question
<em>What IS an experiment?</em> You can find this great talk below.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DxZdWq3Weng?rel=0&%3B=&%3Bshowinfo=0%3B&start=1309" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>The conference talks were great. I was able to attend three:</p>
<ul>
<li><em>Top 5 Lessons Learned in Helping Organizations Adopt MLOps practices</em> from
<a href="https://www.linkedin.com/in/shelbee-eigenbrode/" target="_blank" rel="nofollow noopener noreferrer"><strong>Shelbee Eigenbrode</strong></a>,
Principal AI/ML Specialist</li>
<li><em>Panel: What Every Product Manager Delivering AI Solutions Should Know</em>,
moderated by
<a href="https://www.linkedin.com/in/jessie-lamontagne-89b2a912b/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jessie Lamontagne</strong></a>
(who was lucky enough to take home her very own DeeVee, see below), Data
Science Manager at Kinaxis; with
<a href="https://www.linkedin.com/in/nahlags/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nahla Salem</strong></a>, Senior Product
Manager at <a href="https://www.yelp.com/" target="_blank" rel="nofollow noopener noreferrer">Yelp</a>;
<a href="https://www.linkedin.com/in/anneya-golob/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anneya Golob</strong></a>, Staff Data
Scientist at <a href="https://www.shopify.com/" target="_blank" rel="nofollow noopener noreferrer">Shopify</a>, and
<a href="https://www.linkedin.com/in/phillipgornicki/" target="_blank" rel="nofollow noopener noreferrer"><strong>Phillip Gorniki</strong></a>, St.
Product Manager at <a href="https://www.kinaxis.com/en" target="_blank" rel="nofollow noopener noreferrer">Kinaxis</a>. A particular quote
that was a stand out for me from this panel from Nahla, was, "If everything is
a priority, nothing is a priority." That was a lesson I needed to take to
heart, hence a bumped Heartbeat. 😢</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2fe15e9b439f17b5ad8ac2d6aa406c9/39600/jessie-lamontagne.png" alt="Jessie Lamontagne" title="Jessie Lamontagne" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Jessie Lamontagne of Kinaxis with DeeVee!
(<a href="https://www.linkedin.com/in/jessie-lamontagne-89b2a912b/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>I heard great feedback from attendees on conference talks as well. In general,
the atmosphere at the conference had a fantastic, positive vibe with great
connections made through the event app, the conference itself, and parties and
networking opportunities 🥳🍻 We also thoroughly enjoyed being Expo Booth
neighbors with <a href="https://www.seldon.io/" target="_blank" rel="nofollow noopener noreferrer">Seldon</a> (model serving) and
<a href="https://www.genesiscloud.com/" target="_blank" rel="nofollow noopener noreferrer">Genesis Cloud</a> (environmentally sustainable
GPUs!) I must finally give hats off to the organizers
<a href="https://www.linkedin.com/in/farazthambi/" target="_blank" rel="nofollow noopener noreferrer"><strong>Faraz Thambi</strong></a> and
<a href="https://www.linkedin.com/in/tinaaprile/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tina Aprile</strong></a>, who delivered an
extremely well thought out and run, in-person Conference! If you didn't attend
this year, you should definitely put it on your radar for next, or attend their
<a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> in
November! Plus Toronto was fun! Check out our team dinner the last night from
the CN Tower.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/89437ee946b86dc731eb573ef5ca24e3/03346/team-toronto.jpg" alt="Team Dinner at the CN Tower" title="Team Dinner at the CN Tower" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Team dinner at the CN Tower - Pictured L to R: Gabriella Caraballo, Stephanie
Roy, Mike Moynihan, Jorge Orpinel Perez (forward), me, Mikhail Sveshnikov,
Milecia McGregor (forward), Max Aginsky, Alex Kim (forward), and Dmitry Petrov)</em></p>
<h2 id="dvc-extension-for-vs-code" style="position:relative;">DVC Extension for VS Code<a href="#dvc-extension-for-vs-code" aria-label="dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We just released our DVC extension for VS Code! It was so fun to let the cat out
of the bag to conference goers and watch their eyes light up! 😃 This was a
foreshadowing of events to come at the release! While it hadn't been completely
a secret since
<a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey's tweet</a>
about it a while ago and the fact that it's been on the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code Marketplace</a>
for a couple of months so beta testers could try it out, we did finally,
officially release the tool June 12th.</p>
<p>And OH. MY. GOSH. The response has been amazing! Already over 3,400 people
watched the video below on YouTube. And 1000 more new users downloaded the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a>
from the marketplace, just within the first couple of days!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/LHi3SWGD9nc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>You will find in this extension:</p>
<ul>
<li>tons of experiment tracking and table functionality over your regular CLI</li>
<li>live metrics tracking</li>
<li>the ability to run and queue experiments directly from the experiment table or
the command tree</li>
<li>sorting, drag and drop column and group movement</li>
<li>expanded plot viewing capabilities - zoom into plots and save them as PNGs or
SVGs for your reporting needs</li>
</ul>
<p>If you are a DVC and VS Code user, you will be a happy camper! Please try it and
as always reach out with feedback! We want to make these tools better for you!</p>
<p>Since the release, <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> talked with
<a href="https://twitter.com/ReynaldAdolphe" target="_blank" rel="nofollow noopener noreferrer"><strong>Reynold Adolphe</strong></a> on the VS Code
Livestream and showed off the tool. You can check that out here! 👇🏽</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Eq3100S3aHw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="content-from-the-community" style="position:relative;">Content from the Community<a href="#content-from-the-community" aria-label="content from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There's been lots of juicy content from the Community
<a href="https://dvc.org/blog/may-22-heartbeat" target="_blank" rel="nofollow noopener noreferrer">since the last Heartbeat</a>. When I first
started at Iterative over a year and a half ago, I would hope each month that
there would be enough content from the Community to write about. This is no
longer an issue; I sadly have to filter it now, so that these Heartbeats don't
go on for days. If you've written something about our tools and it hasn't
appeared in a Heartbeat, just know that we see it and we are grateful for all
the Community's efforts to share about our tools! 🙏🏼</p>
<h3 id="alex-strick-van-linschoten---more-data-more-problems-using-dvc-to-handle-data-versioning-for-a-computer-vision-problem" style="position:relative;"><strong>Alex Strick van Linschoten</strong> - More Data, More Problems: Using DVC to handle data versioning for a computer vision problem<a href="#alex-strick-van-linschoten---more-data-more-problems-using-dvc-to-handle-data-versioning-for-a-computer-vision-problem" aria-label="alex strick van linschoten more data more problems using dvc to handle data versioning for a computer vision problem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/strickvl/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Strick van Linschoten</strong></a> brings
us
<a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc" target="_blank" rel="nofollow noopener noreferrer">this great overview of DVC's versioning capabilities</a>
on his use of DVC in a redaction identifier project. He goes through the pluses
of using DVC which he mentions as "be(ing) more or less unchallenged for what it
does in the data versioning domain." He had previously used Git LFS and found it
to be less robust so made the switch to DVC. In his post, he provides a
<a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc:~:text=I%E2%80%99m%20missing%20out%E2%80%A6-,%F0%9F%8F%83%20Appendix%3A%20How%20to%20switch%20from%20git%2Dlfs%20to%20DVC,-When%20I%20first" target="_blank" rel="nofollow noopener noreferrer">tutorial on making the switch from Git LFS to DVC</a>.
We are so grateful to Alex for sharing this guide with the Community!</p>
<p>Also super worthy of mention is Alex's shout-out about our welcoming Community.
We are thankful for this praise and for his contributions to our Community. 🙏🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c213851c35adb7755dbf690baca5f00d/39600/alex-strick-van-linshoten.png" alt="Iterative Community shout out from Alex Strick van Linshoten" title="Iterative Community shout out from Alex Strick van Linshoten" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Thanks for the shout-out Alex!
(<a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="mymlops-stack" style="position:relative;">MyMLOps Stack<a href="#mymlops-stack" aria-label="mymlops stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://mymlops.com/" target="_blank" rel="nofollow noopener noreferrer">MyMLOps.com</a> provides a tool to help you build a cool
diagram for your MLOps Stack. There's no about page there or indication of who
made this for the greater MLOps Community, which is frankly a bit sus.
Nevertheless, we were excited to see DVC included in the section of Experiment
Tracking as it should! We know there are other great experiment tracking tools
out there, and we are content to see that the larger Community is starting to
recognize this capability with DVC! We like to think of it as taking a step
beyond tracking to versioning. To learn more about experiment versioning,
<a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">visit this blog piece</a> from
Technical Product Manager - DVC,
<a href="https://www.linkedin.com/in/david-berenbaum-20b6b424/" target="_blank" rel="nofollow noopener noreferrer">Dave Berenbaum</a>.</p>
<p>Our team had an internal discussion about the absence of our tools from certain
categories, DVC and CML for artifact tracking, CML for Pipeline Orchestration
Runtime Engine, MLEM for Model Registry and Serving. But like everything in this
space, things are changing constantly. Thanks to whoever you are out there that
made this nifty tool!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/594a223e585365459e0234eed352f138/39600/mymlops.png" alt="MyMLOps.com" title="MyMLOps.com" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>MLOps tool
stack diagram generator from MyMLOps.com (<a href="https://mymlops.com/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="samson-zhang-mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-git" style="position:relative;"><strong>Samson Zhang</strong>: MLOps: How DVC smartly manages your data sets for training your machine learning models on top of Git<a href="#samson-zhang-mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-git" aria-label="samson zhang mlops how dvc smartly manages your data sets for training your machine learning models on top of git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/samson-zhang-887135115/" target="_blank" rel="nofollow noopener noreferrer"><strong>Samson Zhang</strong></a> of
<a href="www.littlebigcode.fr">LittleBigCode</a> writes an in-depth article in
<a href="https://medium.com" target="_blank" rel="nofollow noopener noreferrer">Medium</a> on how DVC aptly manages large datasets. He
discusses why DVC is needed and how it is a better option compared to MLFlow
because MLflow does not optimize storage for file duplication like DVC does, as
well as Git-LFS for the same reasons mentioned by Alex Strick van Linschoten in
the piece mentioned above. Samson goes through a very thorough overview of the
tool, how it works and how to use it. He includes some best practices that he
has figured out while using the tool and goes over how to set up a dataset
registry which he finds particularly useful with DVC.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4f841c2fa3b8b4802ca2368d9697b1df/39600/samson-zhang.png" alt="Samson Zhang, DVC Workflow, Cache and Storage" title="Samson Zhang, DVC Workflow, Cache and Storage" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>DVC workflow, cache, and storage
(<a href="https://medium.com/hub-by-littlebigcode/mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-b73857e54e52" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="dror-atariah-getting-to-know-mlem" style="position:relative;"><strong>Dror Atariah</strong>: Getting to Know MLEM<a href="#dror-atariah-getting-to-know-mlem" aria-label="dror atariah getting to know mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 100px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/299a2342f2c5966eba1ce37590234270/39600/awesome.png" alt="Awesome MLEM" title="Getting to Know MLEM" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://www.linkedin.com/in/atariah/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dror Atariah</strong></a> is the first Community
member to write about MLEM! 🎉 In
<a href="http://drorata.github.io/posts/2022/Jun/17/getting-to-know-mlem/" target="_blank" rel="nofollow noopener noreferrer">his piece</a> he
gives a review of the tool and starts with a general overview. Giving it a try
with the iris dataset, he ultimately builds a Docker image with MLEM to get
predictions from a trained model served by MLEM in an API. You can try out his
project <a href="https://github.com/drorata/mlem-review" target="_blank" rel="nofollow noopener noreferrer">in this repo!</a></p>
<h3 id="-new-docs" style="position:relative;">✍🏼 New Docs<a href="#-new-docs" aria-label=" new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>As you can imagine, with new tools come new docs! The docs and product teams
have been furiously busy making sure that you have the docs you need to try our
new tools. Of note please find:</p>
<ul>
<li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM Docs</a></li>
<li><a href="https://mlem.ai/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Model Registry</a>
in <a href="https://dvc.org/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">DVC.org docs</a> as well as in
the <a href="https://mlem.ai/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">MLEM docs</a></li>
<li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code docs and walkthrough</a></li>
</ul>
<h2 id="-tons-of-new-content-on-the-blog" style="position:relative;">✍🏼 Tons of new content on the blog<a href="#-tons-of-new-content-on-the-blog" aria-label=" tons of new content on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi-docker" target="_blank" rel="nofollow noopener noreferrer">Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) and Docker</a></li>
</ul>
<p>Have you ever or are you struggling with syncing data with one of the cloud
providers? We know that comes up a lot in the Discord server. So
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milecia Mc Gregor</a> wrote three detailed
pieces to help you out.</p>
<ul>
<li><a href="https://dvc.org/blog/aws-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to AWS S3</a></li>
<li><a href="https://dvc.org/blog/using-gcp-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to GCP</a></li>
<li><a href="https://dvc.org/blog/azure-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to Azure Blob Storage</a><br>
Whatever
your flavor, she's got you covered. Look out for short videos covering the
same topics this quarter.</li>
</ul>
<p>Find more of your Discord questions answered in the latest editions of Community
Gems. 💎</p>
<ul>
<li><a href="https://dvc.org/blog/may-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">May Community Gems</a></li>
<li><a href="https://dvc.org/blog/june-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">June Community Gems</a></li>
</ul>
<h2 id="-online-course-updates" style="position:relative;">🧑🏽💻 Online Course Updates<a href="#-online-course-updates" aria-label=" online course updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have surpassed 1300 students in our
<a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools School!</a> 🎉 We now have in place:</p>
<ul>
<li>Closed captions</li>
<li>Course guides for each lesson. For some of these, you will find the video
embedded into the lesson itself, but for the lessons that include code
snippets, the guides are in PDF form so that you can copy and paste them to
your heart's content! 😉</li>
</ul>
<p>If you are in the course already or through social media you may have noticed
<a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer">Gema Perreño Piqueras'</a> amazing notes on the
modules she has created (see below). 🚨Spoiler alert: Gema's joining the DevRel
team next week! So look forward to more great content from her.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2f76b185a9df69f8115c4f9765d63833/03346/gema-course-notes.jpg" alt="Gema Perreño Piqueeras' Course Notes" title="Gema Perreño Piqueeras' Course Notes" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Gema Perreño Piqueras' Course Notes
(<a href="https://twitter.com/SoyGema/status/1543210842749079554?s=20&t=DMCw3cN8rFbwlD1hD_rotA" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll be at <a href="https://ai4.io/" target="_blank" rel="nofollow noopener noreferrer">AI4</a> from August 16-18.<br>
<a href="https://twitter.com/fullstackml?lang=en" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> will give a talk as
well as participate in a panel discussion on MLOps. If you are attending, stop
by the booth and say hi or check out one of the in-booth demos we will have on
our tools throughout the day.</p>
<p>Additional conferences we will be attending this year:</p>
<ul>
<li><a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco</li>
<li><a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin</li>
<li><a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> -
London</li>
<li><a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a>-
Toronto</li>
</ul>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the open positions. This month we are especially seeking
a fit for the Senior Software Engineer (Dataset Label Management, Python) role,
so if that fits you or someone else you know, get applying! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a25646ea1418d7a63d5bc4e68079fba9/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Because I missed a month, there's just going to have to be two…</p>
<p>We were excited to see this project come up from
<a href="https://twitter.com/algo_diver" target="_blank" rel="nofollow noopener noreferrer">Chansung</a> using DVC, Iterative Studio,
Huggingface and Jarvis Labs AI.<br>
Looking forward to seeing how it develops! 🍿</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Redrew for easier understanding of <a href="https://twitter.com/hashtag/git_mlops?src=hash&ref_src=twsrc%5Etfw">#git_mlops</a> projects with <a href="https://twitter.com/DVCorg">@DVCorg</a> and <a href="https://twitter.com/jarvislabsai">@jarvislabsai</a>. The code needs to be cleaned, but it now deploys any model from any branches to <a href="https://twitter.com/huggingface">@huggingface</a> model and space repository.<br><br>Basically <a href="https://twitter.com/DVCorg">@DVCorg</a> is heavily used, but I just put the one logo in it. <a href="https://t.co/Cj7Z7KPaOy">pic.twitter.com/Cj7Z7KPaOy</a></p>— chansung (@algo_diver) <a href="https://twitter.com/algo_diver/status/1530455733837647873">May 28, 2022</a></blockquote>
<p>And we have this great Tweet thread from
<a href="https://twitter.com/LeonMenkreo" target="_blank" rel="nofollow noopener noreferrer">Leon Menkreo</a> about how he's taken back
control of his data, models, and predictions with DVC!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I took back control of my data, models, and predictions with<br><br>Data Versioning 🔀<br><br>Everything you need to get started with DVC by <a href="https://twitter.com/DVCorg">@DVCorg</a> in one mega 🧵:<br> <br>⁉️ What is DVC?<br>🔀 DVC & Model Versioning<br>🐍 DVC in python<br>📚 Resources</p>— Leon Menkreo (@LeonMenkreo) <a href="https://twitter.com/LeonMenkreo/status/1545410381677531136">July 8, 2022</a></blockquote>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/using-gcp-remotes-in-dvchttps://dvc.org/blog/using-gcp-remotes-in-dvcWed, 06 Jul 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s
common to store them in cloud storage. You’ll also be working with different
versions of the same datasets to train a model, so it’s crucial to have a tool
that enables you to do this quickly and easily. That’s why we’re going to do a
quick walkthrough of how to set up a remote in a GCP storage bucket and handle
data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p>
<p>We’ll start by creating a new storage bucket in our GCP account, then we’ll show
how you can add DVC to your project, and finally, we’ll make updates to the
dataset with DVC commands. We’ll be working with
<a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an
example to play with. By the time you finish, you should be able to create this
setup for any machine learning project using a GCP remote.</p>
<h2 id="set-up-a-gcp-storage-bucket" style="position:relative;">Set up a GCP storage bucket<a href="#set-up-a-gcp-storage-bucket" aria-label="set up a gcp storage bucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Make sure that you already have a
<a href="https://console.cloud.google.com" target="_blank" rel="nofollow noopener noreferrer">GCP account</a>. You’ll need a valid credit card
to create a new account. Once you’re logged into your account, you should see a
screen like this with some of the services GCP offers.</p>
<p><em>Note:</em> Remember, GCP does have a
<a href="https://cloud.google.com/free/docs/gcp-free-tier" target="_blank" rel="nofollow noopener noreferrer">free tier</a> if you just want
to try it out.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9291575b358c6d1e93d3902bd8f8df6/39600/gcp_initial_page.png" alt="GCP initial page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>From here, you'll need to create a new project. Search for "create a project"
and click the "IAM & Admin" option. You'll enter the name of the project, which
is <code>Bicycle Project</code>, and choose the organization and location and click the
<code>Create</code> button. This will take you to your project dashboard and show you all
of the stats and settings you have available.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/31c5ad53787a867d969e87e822a0832d/39600/gcp_new_project.png" alt="create a new GCP project" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then you need to go to <code>Cloud Storage</code> in the left sidebar to create a bucket to
store the data. When you get to the Cloud Storage page, you should see something
similar to this and you’ll click the <code>Create Bucket</code> button.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/990bd68c194aa8588a25cb16e7cee4ac/39600/create_gcp_bucket.png" alt="create_gcp_bucket.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>The Bucket page will have a lot of configurations you can set, but you can leave
the settings in the default state if there’s nothing you need to customize. We
have named this example bucket <code>updatedbikedata</code> as you can see below.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/95a0fcf3369dc54bc0ae23f1d28eb937/39600/gcp_bucket_options.png" alt="gcp_bucket_options.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Now you can save your changes and you’ll be redirected to the <code>Bucket Details</code>
page and you’ll see the bucket you just created.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/594ddcf13327666ccddcca1d3f0ec75a/39600/created_gcp_bucket.png" alt="created_gcp_bucket.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="get-your-credentials" style="position:relative;">Get your credentials<a href="#get-your-credentials" aria-label="get your credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Since you have the bucket created, we need to get the credentials to connect the
GCP remote to the project. Go to the <code>IAM & Admin</code> service and go to
<code>Service Accounts</code> in the left sidebar.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/696a2d32a8294f63e608c3a2823ef65d/39600/gcp_empty_service_account.png" alt="no service accounts" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Click the <code>Create Service Account</code> button to create a new service account that
you'll use to connect to the DVC project in a bit. Now you can add the name and
ID for this service account and keep all the default settings. We've chosen
<code>bicycle-service-account</code> for the name and <code>bicycle-account</code> for the ID. Click
<code>Create and Continue</code> and it will show the permissions settings. Select <code>Owner</code>
in the dropdown and click <code>Continue</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ffa1f3370523aa3068032c6db3e7782f/39600/gcp_service_account_permissions.png" alt="service account permissions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then add your user to have access to the service account and click <code>Done</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/867a743f3bf910458d9c473f62a9906d/39600/gcp_service_account_user_access.png" alt="service account user access" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Finally, you'll be redirected to the <code>Service accounts</code> page.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/acd81a329b5b3c20d943c378603d9ea4/39600/gcp_create_service_account.png" alt="service account with name and ID" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>You’ll see your service account and you’ll be able to click on <code>Actions</code> and go
to where you <code>Manage keys</code> for this service account.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/989cff34794f4244822a454c66b7dbd0/39600/gcp_service_account.png" alt="manage keys on service account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Once you’ve been redirected, click the <code>Add Key</code> button and this will bring up
the credentials you need to authenticate your GCP account with your project. Go
ahead and download the credentials file and store it somewhere safe.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9625bb956a5e8bd2c69f71d87df4a863/39600/gcp_key.png" alt="gcp_key.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>That’s it for setting up your storage bucket and getting the credentials you
need! Now let’s add DVC to our demo repo and set up the remote.</p>
<h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First, add DVC as a requirement to your project with the following installation
command:</p>
<p><code>$ pip install 'dvc[gs]'</code></p>
<p>Then you can initialize DVC in your own project with the following command:</p>
<p><a href="https://dvc.org/doc/command-reference/init"><code>$ dvc init</code></a></p>
<p>This will add all of the DVC internals needed to start versioning your data and
tracking experiments. Now we need to set up the remote to connect our project
data stored in GCP to the DVC repo.</p>
<h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now we can make the GCP storage the default remote for the project with the
following command:</p>
<p><a href="https://dvc.org/doc/command-reference/remote/add#-d"><code>$ dvc remote add -d bikes gs://updatedbikedata</code></a></p>
<p>This creates a default remote called <code>bikes</code> that connects to the
<code>updatedbikedata</code> bucket we made earlier which is where the any data for the
model will be stored.</p>
<h3 id="add-gcp-credentials" style="position:relative;">Add GCP credentials<a href="#add-gcp-credentials" aria-label="add gcp credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In order for DVC to be able to push and pull data from the remote, you need to
have valid GCP credentials.</p>
<p>If you are using the
<a href="https://cloud.google.com/sdk/docs/install-sdk" target="_blank" rel="nofollow noopener noreferrer">GCP CLI (google-cloud-sdk)</a>
already, you should be able to run <code>gcloud auth application-default login</code>. This
method doesn't require a service account.</p>
<p>You can also authenticate with the service account we created earlier in a
couple of ways with that credentials file we downloaded.</p>
<p>You can run the following command with your service account email.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">gcloud</span> auth activate-service-account bicycle-service-account@tonal-history-154018.iam.gserviceaccount.com <span class="token parameter variable">--key-file</span><span class="token operator">=</span><span class="token punctuation">..</span>/tonal-history-154018-e62a79baf90f.json</span></code></pre></div>
<p>If you don't have the GCP CLI installed and you want to use the service account,
you can set the <code>GOOGLE_APPLICATION_CREDENTIALS</code> environment variable to point
to the credentials file, like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">GOOGLE_APPLICATION_CREDENTIALS</span><span class="token operator">=</span><span class="token string">'../tonal-history-154018-e62a79baf90f.json'</span></span></code></pre></div>
<p>Or you can add the credentials file location with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes credentialpath <span class="token string">'../tonal-history-154018-e62a79baf90f.json'</span></span></code></pre></div>
<p>You can check out more about authentication
<a href="https://cloud.google.com/sdk/docs/authorizing" target="_blank" rel="nofollow noopener noreferrer">here in the GCP docs</a>.</p>
<h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now you can push data from your local machine to the GCP remote! First, add the
data you want DVC to track with the following command:</p>
<p><a href="https://dvc.org/doc/command-reference/add"><code>$ dvc add data</code></a></p>
<p>This will allow DVC to track the entire <code>data</code> directory so it will note when
any changes are made. Then you can push that data to your GCP remote with this
command:</p>
<p><a href="https://dvc.org/doc/command-reference/push"><code>$ dvc push</code></a></p>
<p>Here's what that data will look like when it has been successfully uploaded to
GCP.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/07a71607f24160ce95b254e6a00fc2cc/39600/data_in_gcp.png" alt="data in GCP" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then if you move to a different machine or someone else needs to use that data,
it can be accessed by cloning or forking the project repo, setting up the remote
and running:</p>
<p><a href="https://dvc.org/doc/command-reference/pull"><code>$ dvc pull</code></a></p>
<p><em>Note:</em> Depending on the authentication method being used, there might be some
required extra steps, such as making sure users actually have the permissions to
read/write to the bucket.</p>
<p>That’s it! Now you can connect any DVC project to a GCP storage bucket. If you
run into any issues, make sure to check that your credentials are valid, check
if your user has MFA enabled, and check that the user has the right level of
permissions.</p>https://dvc.org/blog/june-22-community-gemshttps://dvc.org/blog/june-22-community-gemsWed, 29 Jun 2022 00:00:00 GMT<h2 id="is-there-a-shorthand-command-to-commit-changes-to-all-modified-files-in-dvc-without-manually-adding-them-all-individually" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981498675689828362" target="_blank" rel="nofollow noopener noreferrer">Is there a shorthand command to commit changes to all modified files in DVC without manually adding them all individually?</a><a href="#is-there-a-shorthand-command-to-commit-changes-to-all-modified-files-in-dvc-without-manually-adding-them-all-individually" aria-label="is there a shorthand command to commit changes to all modified files in dvc without manually adding them all individually permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thanks for the question @Ramnath T!</p>
<p>If you already have data tracked by DVC, the <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> command adds all the
changes to those files or directories without having to name each target. You'll
still need to remember to commit any other changes you've made to Git as well.</p>
<p>If you don't have data tracked by DVC, run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add <file name or folder name></code></a>
and the data will be added to your local cache and no commit is needed. This is
how we make DVC aware of any new data we want versioned.</p>
<p>When you run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, a file hash will be calculated, the file content will be
moved to the cache, and a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file will be created to start tracking the
added data. If you're working with remotes using the <code>--to-remote</code> option, you
can skip the local cache entirely and move the file contents directly to your
remote storage.</p>
<h2 id="how-can-i-connect-iterative-studio-to-a-remote-repo-on-a-private-network-like-a-gitlab-server" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981543978644172830" target="_blank" rel="nofollow noopener noreferrer">How can I connect Iterative Studio to a remote repo on a private network, like a GitLab server?</a><a href="#how-can-i-connect-iterative-studio-to-a-remote-repo-on-a-private-network-like-a-gitlab-server" aria-label="how can i connect iterative studio to a remote repo on a private network like a gitlab server permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Good question about <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> from
@LilDataScientist!</p>
<p>This is something that our users asked quite a bit, so we wrote up a whole guide
about
<a href="https://dvc.org/doc/studio/user-guide/connect-custom-gitlab-server" target="_blank" rel="nofollow noopener noreferrer">custom GitLab server connections</a>.
It's a quick walkthrough of how to set up the permissions you'll need and
connecting your team to Studio.</p>
<p>You can find lots of great guides and explanations about everything Studio in
the <a href="https://dvc.org/doc/studio/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> section of the docs!</p>
<h2 id="how-does-dvc-get-url-interact-with-the-cache-compared-to-dvc-import-url" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981862313076346920" target="_blank" rel="nofollow noopener noreferrer">How does <code>dvc get-url</code> interact with the cache compared to <code>dvc import-url</code>?</a><a href="#how-does-dvc-get-url-interact-with-the-cache-compared-to-dvc-import-url" aria-label="how does dvc get url interact with the cache compared to dvc import url permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This is an awesome question from @Gema Parreno!</p>
<p>When you run <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a>, it downloads the file/directory to your local file
system. It's <em>not</em> tracking the downloaded data with a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file. It's just
pulling that data from some source to your file system. If you want to download
a file or directory without needing a DVC project, you can use the <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a>
command.</p>
<p>On the other hand, when you run <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a>, the local <code>cache</code> folder
inside of <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> will be updated. This is similar to running <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a> and
<a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> together except that <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> also saves a link to the
original file/directory location so that if it changes, you can download the
updated data.</p>
<p>There is one more option to bypass the local cache and transfer data directly to
your remote storage using <a href="https://dvc.org/doc/command-reference/import-url#--to-remote"><code>dvc import-url <url> --to-remote</code></a>. This doesn't
download anything to your local cache so it's another way to transfer data
between remotes.</p>
<h2 id="if-an-image-is-present-in-different-directories-in-different-projects-will-the-shared-cache-store-them-both-as-one-hash-or-will-their-different-paths-mean-the-same-image-appears-twice-in-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/984408209387298837" target="_blank" rel="nofollow noopener noreferrer">If an image is present in different directories in different projects, will the shared cache store them both as one hash or will their different paths mean the same image appears twice in the cache?</a><a href="#if-an-image-is-present-in-different-directories-in-different-projects-will-the-shared-cache-store-them-both-as-one-hash-or-will-their-different-paths-mean-the-same-image-appears-twice-in-the-cache" aria-label="if an image is present in different directories in different projects will the shared cache store them both as one hash or will their different paths mean the same image appears twice in the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Great question about the cache from @paulwrightkcl!</p>
<p>DVC will index the whole directory, but there will only be one hash per file. So
the same image will only appear once in the cache. What <em>will</em> be duplicated in
the cache is the <code>.dir</code> hash that DVC uses internally as the directory tree
representation.</p>
<p>In summary, the image file is only stored in the shared cache once unless it's
modified in one of the directories.</p>
<h2 id="is-it-possible-to-limit-which-columns-show-for-experiments-in-the-metrics-table" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/985448515402616842" target="_blank" rel="nofollow noopener noreferrer">Is it possible to limit which columns show for experiments in the metrics table?</a><a href="#is-it-possible-to-limit-which-columns-show-for-experiments-in-the-metrics-table" aria-label="is it possible to limit which columns show for experiments in the metrics table permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Nice question from @DylanTF!</p>
<p>You can use <a href="https://dvc.org/doc/command-reference/exp/show#--drop"><code>dvc exp show --drop</code></a> (or <code>--keep</code>) to decide what to hide (or
show). For example, if you have a table like this:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./clf<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./data<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./data/train.pkl<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./src/train.py<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>src/evaluate.py<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace - - - 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1
mlem-serve Jun 16, 2022 0.76681 0.38867 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>You could clean it up with a command like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--drop</span> <span class="token string">'Created|train.seed|./clf|./data/*|./src/train.py|src/evaluate.py'</span></span></code></pre></div>
<p>Then get a table like this:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────
<span class="token rows"> workspace - - 300 75
mlem-serve 0.76681 0.38867 300 75
</span> ─────────────────────────────────────────────────────────────────</code></pre></div>
<p>Alternatively, you can run the following command to only show the columns that
have changed in the experiment run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--only-changed</span></span></code></pre></div>
<p>This will produce a table similar to this one:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>src/train.py<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace - - - 325 94279e0
mlem-serve Jun 16, 2022 0.76681 0.38867 300 bdc3fe9
</span> ─────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>You can also look at/edit these tables with the
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC VS Code extension</a>!
If you're interested in more advanced visualizations, you should try out
<a href="https://studio.datachain.ai/#features" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>.</p>
<h2 id="is-it-possible-to-create-commit-and-push-updates-to-datasets-using-dvc-with-python-instead-of-the-command-line" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/988895726257991740" target="_blank" rel="nofollow noopener noreferrer">Is it possible to create, commit, and push updates to datasets using DVC with Python instead of the command line?</a><a href="#is-it-possible-to-create-commit-and-push-updates-to-datasets-using-dvc-with-python-instead-of-the-command-line" aria-label="is it possible to create commit and push updates to datasets using dvc with python instead of the command line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Fantastic question from @wlu07!</p>
<p>Yes, we do have an internal <code>Repo</code> class to do DVC operations using Python. You
can refer to the
<a href="https://github.com/iterative/dvc/tree/main/dvc/commands" target="_blank" rel="nofollow noopener noreferrer">GitHub repo for the DVC CLI commands</a>
to see how the CLI arguments are translated into the <code>Repo</code> function arguments
and you can see how to use some of the
<a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer"><code>Repo</code> methods in our docs</a>.</p>
<p>Here's an example of how you might run DVC commands using Python:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo
repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token string">"."</span><span class="token punctuation">)</span>
repo<span class="token punctuation">.</span>add<span class="token punctuation">(</span><span class="token string">"test_dataset.csv"</span><span class="token punctuation">)</span>
repo<span class="token punctuation">.</span>push<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
<p>Keep in mind that <code>dvc.repo.Repo</code> is not an official public API, so there is no
guarantee it will always be in stable state.</p>
<h2 id="how-can-i-write-generated-artifacts-back-to-a-github-repo-after-a-github-workflow-is-finished" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/983379949023006750" target="_blank" rel="nofollow noopener noreferrer">How can I write generated artifacts back to a GitHub repo after a GitHub workflow is finished?</a><a href="#how-can-i-write-generated-artifacts-back-to-a-github-repo-after-a-github-workflow-is-finished" aria-label="how can i write generated artifacts back to a github repo after a github workflow is finished permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Wonderful CML question from @Fourtin!</p>
<p>If you want to add the artifact to your repo just like you would a file, then
you should check out the <a href="https://cml.dev/doc/ref/pr" target="_blank" rel="nofollow noopener noreferrer"><code>cml pr <file></code> command</a>.
You can use this to merge pull requests to the same branch the workflow was
triggered from.</p>
<p>For example, if you run a command like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token cml">cml pr</span> <span class="token parameter variable">--squash</span> train.py</span></code></pre></div>
<p>It will run <code>git add train.py</code>, commit the change, create a new branch, open a
pull request, and squash and merge it.</p>
<h2 id="is-there-a-way-to-programmatically-update-the-content-of-paramspy" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/987004036995764304" target="_blank" rel="nofollow noopener noreferrer">Is there a way to programmatically update the content of <code>params.py</code>?</a><a href="#is-there-a-way-to-programmatically-update-the-content-of-paramspy" aria-label="is there a way to programmatically update the content of paramspy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thanks for asking this @petek!</p>
<p>If you have a <code>params.py</code> file like this:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">TrainTestSplit</span><span class="token punctuation">:</span>
FOLDER <span class="token operator">=</span> <span class="token string">"data/train_test_split"</span>
SPLIT_METHOD <span class="token operator">=</span> <span class="token string">"proportional"</span></code></pre></div>
<p>In DVC, you can update the params and run <a href="https://dvc.org/doc/command-reference/exp/run#--set-param"><code>dvc exp run --set-param <param></code></a>.
Here's an example of what that might look like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> params.py:TrainTestSplit.SPLIT_METHOD<span class="token operator">=</span><span class="token string">"proportional"</span></span></code></pre></div>
<p><em>Note:</em>
<a href="https://dvc.org/doc/command-reference/params#examples-python-parameters-file" target="_blank" rel="nofollow noopener noreferrer">It may not be able to update Python parameters correctly</a>.
Because of this, we recommend you use <code>params.yaml</code> files.</p>
<p>If you need a pure Python solution, you could try something like this:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>utils<span class="token punctuation">.</span>serialize <span class="token keyword">import</span> modify_py
<span class="token keyword">with</span> modify_py<span class="token punctuation">(</span><span class="token string">"params.py"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> d<span class="token punctuation">:</span>
d<span class="token punctuation">[</span><span class="token string">"key"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"value"</span></code></pre></div>
<hr>
<p><img src="https://media.giphy.com/media/pdSncNyYgaH0wqaCqp/giphy.gif" alt="Duck Dynasty GIF by DefyTV"></p>
<p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date
with us to find out what it is!
<a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a>
to stay up to date with specifics as we get closer to the event!</p>
<p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC and CML questions
answered!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the
community!</p>https://dvc.org/blog/DVC-VS-Code-extensionhttps://dvc.org/blog/DVC-VS-Code-extensionTue, 14 Jun 2022 00:00:00 GMT<p>Since its beta release in 2017, DVC has become an essential tool for many data
science teams. Its data versioning capabilities, reproducible pipelines, and
experiment tracking features are at the core of our ecosystem of open MLOps
tools.</p>
<p>Today we are proud to launch a new product that extends how machine learning
teams can use DVC:
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">our extension for Visual Studio Code</a>.</p>
<p>With this extension, you get a full VS Code-native experimentation platform for
your machine learning projects. Control your datasets and models, run
experiments, view metrics, create plots, and much more. You can do this all from
the comfort of your IDE, without the need for external services or logins. The
only thing you need is a
<a href="https://dvc.org/doc/start/data-management/data-pipelines#get-started-data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipeline</a>.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/overview-9b53e8f5328a63e7590c574ffcd46f12.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<p>
</p><section class="elp-content-holder">
<a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Download the DVC extension</h4>
<div class="elp-description">Install the DVC extension from the VS Code marketplace to get started. Manage your data, run experiments,
compare metrics, and visualize plots, all from the comfort of your IDE.</div>
<div class="elp-link">marketplace.visualstudio.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-06-14/vscode-logo-9aa0983c47274b4c145190fb005e1bdd.png" alt="Download the DVC extension">
</div>
</a>
</section>
<p></p>
<h1 id="why-a-vs-code-extension" style="position:relative;">Why a VS Code extension?<a href="#why-a-vs-code-extension" aria-label="why a vs code extension permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We built DVC to expand upon the Git workflow to
<a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">make it well-suited for ML experimentation</a>.
This approach brought us independence from the infrastructure and provided a
natural connection to best practices from software engineering. However, a pure
CLI tool can only take things so far when it comes to visualizing experiments or
displaying large tables.</p>
<p><a href="https://insights.stackoverflow.com/survey/2021#section-most-popular-technologies-integrated-development-environment" target="_blank" rel="nofollow noopener noreferrer">VS Code is the IDE of choice for many</a>
and was a natural choice for a platform to add a graphical interface to DVC.</p>
<p>With this extension, we want to:</p>
<ul>
<li>Move the ML experimentation workflow into your IDE</li>
<li>Provide interactive plots and tables for analyzing ML experiments</li>
<li>Make DVC more accessible by providing an alternative to the complexity of the
CLI</li>
</ul>
<p>As data scientists, DVC is our toolbox. This extension turns VS Code into our
workshop.</p>
<h1 id="features" style="position:relative;">Features<a href="#features" aria-label="features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Our extension introduces the DVC view, your one-stop-shop for everything related
to your ML experiments. You can run new experiments from here, manage
parameters, and compare metrics and plots for different models.</p>
<p>The extension also adds panes to the
<a href="https://code.visualstudio.com/docs/getstarted/userinterface#_explorer" target="_blank" rel="nofollow noopener noreferrer"><em>Explorer</em></a>
and <a href="https://code.visualstudio.com/Docs/editor/versioncontrol" target="_blank" rel="nofollow noopener noreferrer"><em>Source Control</em></a>
views for managing all datasets and models in your DVC repository.</p>
<p><a href="https://youtu.be/LHi3SWGD9nc" target="_blank" rel="nofollow noopener noreferrer"><em>Check out the feature video on Youtube!</em></a></p>
<h2 id="experiment-bookkeeping" style="position:relative;">Experiment bookkeeping<a href="#experiment-bookkeeping" aria-label="experiment bookkeeping permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Quickly run new experiments and compare their resulting metrics in the
experiments table. Use the command palette or buttons to reproduce old
experiments, run new ones, or add them to the queue for later.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/experiment-bookkeeping-b616e98446515f0510b5fb97df5cd613.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<h2 id="interactive-plots" style="position:relative;">Interactive plots<a href="#interactive-plots" aria-label="interactive plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Select experiments to compare and visualize their performance in interactive
plots. You can export these plots to PNG or SVG for use elsewhere.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/compare-experiments-a5306a13eff94b8e30d9d58e12b2a443.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<h2 id="live-tracking" style="position:relative;">Live tracking<a href="#live-tracking" aria-label="live tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Get insight into the training process of your models with live tracking of
metrics. As soon as your metrics change, your plots will be updated
automatically.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/live-metrics-d12f70f91085124fd74a4af4ea8f1f16.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<h2 id="reproducibility" style="position:relative;">Reproducibility<a href="#reproducibility" aria-label="reproducibility permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Click <em>Apply to workspace</em> to reproduce any past experiment. DVC will restore
all artifacts for that experiment, and you can rerun it or use it as a base for
a new experiment.</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/apply-to-workspace-923ba22dd0a7a6ef62cc0145ee2fc831.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<h2 id="data-management" style="position:relative;">Data management<a href="#data-management" aria-label="data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Use the DVC tracked panel in the
<a href="https://code.visualstudio.com/docs/getstarted/userinterface#_explorer" target="_blank" rel="nofollow noopener noreferrer"><em>Explorer</em></a>
view to quickly navigate the files in the DVC project(s) in your workspace.</p>
<p>The <a href="https://code.visualstudio.com/Docs/editor/versioncontrol" target="_blank" rel="nofollow noopener noreferrer"><em>Source Control</em></a>
view now lets you manage datasets and models tracked by DVC without using the
terminal. The DVC panel shows you the state of the workspace. From here, you can
track artifacts and synchronize versions with your remote repository. Just like
you use Git to track changes to your code!</p>
<p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/data-management-d768171bc3ae20848014004d6bee36e0.mp4" type="video/mp4"> Your
browser does not support the video tag. </video></p>
<hr>
<h1 id="whats-next" style="position:relative;">What's next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>From here on out, we plan on making the extension even better with new features
such as pipeline (DAG) support,
<a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a> integration for
remote execution, autocomplete for <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, and parallel coordinate plots.</p>
<p>Of course, we would love to hear what you look forward to most. Please give us
feedback on what you would like to see next!</p>
<p><img src="https://media.giphy.com/media/cEYFeE4wJ6jdDVBiiIM/giphy.gif" alt="Space Cowboy GIF"></p>
<h1 id="thank-you-️" style="position:relative;">Thank you! ❤️<a href="#thank-you-%EF%B8%8F" aria-label="thank you ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We would sincerely like to thank everyone who has helped make this project
possible:</p>
<ul>
<li><a href="https://github.com/hediet" target="_blank" rel="nofollow noopener noreferrer">Henning Dieterichs</a>, for helping us get started</li>
<li><a href="https://twitter.com/DynamicWebPaige" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey</a>, for her support and warm
tweets</li>
<li><a href="https://www.linkedin.com/in/siddhanthunnithan/" target="_blank" rel="nofollow noopener noreferrer">Sid Unnithan</a>, for his review
and help in getting the word out there</li>
<li><a href="https://vscode-dev-community.slack.com/join/shared_invite/zt-zq9w7ddw-VD1NVQ4p2XLT7vh_kO7bJA#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">The VS Code developer community</a></li>
<li>Everyone who has beta-tested the extension and provided their feedback!</li>
</ul>
<h1 id="resources" style="position:relative;">Resources<a href="#resources" aria-label="resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Want to read more about DVC or the extension? Check out the following pages:</p>
<ul>
<li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension on the VS Code marketplace</a></li>
<li><a href="https://github.com/iterative/vscode-dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a></li>
<li><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC docs</a></li>
<li><a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">Dave Berenbaum's post on DVC's experiment versioning</a></li>
<li><a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Alex Kim's guide on setting up an ML pipeline</a></li>
<li><a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Iterative community on Discord</a></li>
</ul>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/LHi3SWGD9nc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>https://dvc.org/blog/azure-remotes-in-dvchttps://dvc.org/blog/azure-remotes-in-dvcMon, 13 Jun 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s
common to store them in cloud storage. You’ll also be working with different
versions of the same datasets to train a model, so it’s crucial to have a tool
that enables you to switch between datasets quickly and easily. That’s why we’re
going to do a quick walkthrough of how to set up a remote with Azure Blob
Storage and handle data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p>
<p>We’ll start by creating a new blob storage container in our Azure account, then
we’ll show how you can add DVC to your project. We’ll be working with
<a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an
example to play with.</p>
<admon type="info">
<p>By the time you finish, you should be able to create this setup for any machine
learning project using an Azure remote.</p>
</admon>
<h2 id="set-up-an-azure-blob-storage-container" style="position:relative;">Set up an Azure blob storage container<a href="#set-up-an-azure-blob-storage-container" aria-label="set up an azure blob storage container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Make sure that you already have a
<a href="https://azure.microsoft.com/en-us/features/azure-portal/" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure account</a>.
When you log in, you should see a page like this.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/64c512d8a94c04c1857ad929b0111f22/39600/initial_azure.png" alt="initial Azure page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Search for <code>storage accounts</code> in the search bar and click <code>Storage accounts</code>
under <code>Services</code>. Make sure you don't click the "classic" option.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d36abba60a0f4782c01e831f9d414c16/39600/storage_account_search.png" alt="search for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>This will bring you to the <code>Storage accounts</code> page where you'll need click the
<code>Create storage account</code> button.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1ded7676bdc2404e371fc87c428ef229/39600/storage_account_page.png" alt="storage accounts page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Now you need to enter a <code>Resource group</code> and name for the account. You can
create a new resource group right here, like we do, and call it
<code>BicycleProject</code>. We'll name this storage account <code>bicycleproject</code>. Then you can
leave all the default settings in place and click <code>Review + create</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/681ef7efd9fb3b770ca862910073965d/39600/storage_account_details.png" alt="storage account details" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Azure will run validation on the account and then you'll be able to click
<code>Create</code> and it will generate the storage account.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/56870dafd3e809dfdbdc5da58b98b4a6/39600/created_storage_account.png" alt="created storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>You'll get redirected to a new page and you should click the <code>Go to resource</code>
button. Now you should see all of the details for your storage account. In the
left sidebar, got to on <code>Data storage</code> > <code>Containers</code>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/336f02f3dc5208eb4fd6aa657bb83dba/39600/bicycle_project_account.png" alt="bicycle project account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then click the <code>+ Container</code> button at the top of the new page and you'll see a
right sidebar open. In the name field, type <code>bikedata</code> and then click <code>Create</code>.
Now we have everything set up for the blob storage to work.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/469f2358acd6ad6849cef0114f3a80a2/39600/bikedata_container.png" alt="new container for bike data" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="set-the-right-roles-for-your-azure-account" style="position:relative;">Set the right roles for your Azure account<a href="#set-the-right-roles-for-your-azure-account" aria-label="set the right roles for your azure account permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You'll need the right roles on your storage account and your container in order
to connect this remote storage to your machine learning project.</p>
<p>On the page for your <code>bicycleproject</code> storage account, go to the
<code>Access Control (IAM)</code> in the left sidebar.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7b079b25579752299c2692d3559a49bb/39600/storage_account_iam.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>On this page, you'll click <code>Add role assignment</code> and get directed to the page
with all of the roles.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f251e28b6fd0fd87465cc467e7febacd/39600/storage_account_role.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Select the <code>Storage Blob Data Contributor</code> role and click <code>Next</code></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a6c1bb759192b9b66b5461dffba75ad1/39600/storage_account_member.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then you can click <code>+ Select members</code> to add this role to your user.</p>
<p>You'll also need to go through this exact flow for your <code>bikedata</code> container, so
make sure you do this immediately after your storage account is updated.</p>
<p>Since our Azure storage account and container have the correct roles now, let's
set up the project!</p>
<h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First, add DVC as a requirement to your project with the following installation
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token string">'dvc[azure]'</span></span></code></pre></div>
<p>Then you can initialize DVC in your own project with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div>
<p>This will add all of the DVC internals needed to start versioning your data and
tracking experiments. Now we need to set up the remote to connect our project
data stored in Azure to the DVC repo.</p>
<h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now we can add a default to the project with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> bikes azure://bikedata</span></code></pre></div>
<p>This creates a default remote called <code>bikes</code> that connects to the <code>bikedata</code>
container we made earlier which is where the training data for the model will be
stored.</p>
<h3 id="add-azure-credentials" style="position:relative;">Add Azure credentials<a href="#add-azure-credentials" aria-label="add azure credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In order for DVC to be able to push and pull data from the remote, you need to
have valid Azure credentials.</p>
<p>By default, DVC authenticates using your
<a href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli" target="_blank" rel="nofollow noopener noreferrer">Azure CLI</a>
configuration.</p>
<p>Run the following command to authenticate with Azure.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">az</span> login
</span>A web browser has been opened at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Please continue the login in the web browser. If no web browser is available or if the web browser fails to open, use device code flow with `az login --use-device-code`.
[
{
"cloudName": "AzureCloud",
"homeTenantId": "some-id",
"id": "some-id",
"isDefault": true,
"managedByTenants": [],
"name": "Azure subscription 1",
"state": "Enabled",
"tenantId": "some-id",
"user": {
"name": "[email protected]",
"type": "user"
}
}
]</code></pre></div>
<p>This should open a window that looks like this where you can enter your login
credentials.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e75c1384fe653dd3dd95bae6893dfc5d/39600/azure_auth_page.png" alt="Azure CLI authentication page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>You can check out more details on this command
<a href="https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli" target="_blank" rel="nofollow noopener noreferrer">here in the Azure docs</a>.
If you want to use a different authentication method with DVC, check out
<a href="https://dvc.org/doc/command-reference/remote/modify#microsoft-azure-blob-storage" target="_blank" rel="nofollow noopener noreferrer">our docs here</a>.</p>
<p>You will also need to manually define the storage account name with the
following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> bikes account_name <span class="token string">'bicycleproject'</span></span></code></pre></div>
<h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now you can push data from your local machine to the Azure remote! First, add
the data you want DVC to track with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data</span></code></pre></div>
<p>This will allow DVC to track the entire <code>data</code> directory so it will note when
any changes are made. Then you can push that data to your Azure remote with this
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div>
<p>Here's what the data might look like in your Azure container.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7cbb06d18650df41a08fe6a8c35acb3d/39600/data_in_azure.png" alt="data in Azure container" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then if you move to a different machine or someone else needs to use that data,
it can be accessed by cloning or forking the project repo and running:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div>
<p>This will get any data from your remote and download it to your local machine.</p>
<admon type="info">
<p>Authentication has to be setup locally on any machine you need to pull or push
data from. That means running the <code>az login</code> command on any other machine. You
don't need to go through the DVC setup again.</p>
</admon>
<hr>
<p>That’s it! Now you can connect any DVC project to an Azure blob storage
container. If you run into any issues, makes sure to check that your credentials
are valid, check if your user has MFA enabled, and check that the user has the
right level of permissions.</p>https://dvc.org/blog/MLEM-releasehttps://dvc.org/blog/MLEM-releaseWed, 01 Jun 2022 00:00:00 GMT<p>With MLEM ML teams get a single tool to <strong>run your models anywhere</strong> that
strikes to cover all model productionization scenarios you have.</p>
<p>MLEM enables this via <strong>model metadata codification</strong>: saving all information
that is required to use a model later. Besides packaging a model for deployment
it can be used for many things, including search and documentation. To make it
even more convenient, MLEM uses human-readable YAML files for that.</p>
<p>Finally, using Git to keep that metainformation allows you to create a
<strong>Git-native model registry</strong>, allowing you to handle model lifecycle management
in Git, getting all benefits of CI/CD. Which makes your ML team one step closer
to GitOps.</p>
<p>We built MLEM to address issues that MLOps teams have around managing model
information as they move them from training and development to production and,
ultimately, retirement. The Git-based model
(<a href="https://iterative.ai/why-iterative/" target="_blank" rel="nofollow noopener noreferrer">one of our core philosophies</a>) aligns
model operations and deployment with software development teams – information
and automation are all based on familiar DevOps tools – so that deploying any
model into production is that much faster.</p>
<h1 id="model-metadata-codification" style="position:relative;">Model metadata codification<a href="#model-metadata-codification" aria-label="model metadata codification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Capturing model-specific information requires an understanding of the
Programming language and ML frameworks they're created with. That's why MLEM is
a Python-specific tool. To provide a developer-first experience, MLEM exposes
carefully designed CLI to help you manage DevOps parts of the workflow from CLI
and Python API to handle model productionization programmatically.</p>
<p>It's easy to start using MLEM, since it integrates nicely into your existing
training workflows by adding a couple of lines:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> mlem
mlem<span class="token punctuation">.</span>api<span class="token punctuation">.</span>save<span class="token punctuation">(</span>
my_model<span class="token punctuation">,</span>
<span class="token string">"mlem-model"</span><span class="token punctuation">,</span>
sample_data<span class="token operator">=</span>train
<span class="token punctuation">)</span></code></pre></div>
<p>That produces two files: model binary and model metadata, which is a <code>.mlem</code>
file:</p>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ <span class="token function">ls</span> models
mlem-model mlem-model.mlem</code></pre></div>
<p>MLEM automatically detects everything you need to run the model: ML framework,
model dependencies (i.e. Python requirements), methods, and input/output data
schema (note, that we didn't specify those above at <code>save</code>!).</p>
<p>This enables easy codification of arbitrary complex models, such as a Python
function in which you average a couple of frameworks or a custom Python class
that uses different libraries to generate the features and make a prediction.
MLEM saves this information in a simple human-readable YAML file:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># mlem-model.mlem</span>
<span class="token key atrule">artifacts</span><span class="token punctuation">:</span>
<span class="token key atrule">data</span><span class="token punctuation">:</span>
<span class="token key atrule">hash</span><span class="token punctuation">:</span> b7f7e869f2b9270c516b546f09f49cf7
<span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">166864</span>
<span class="token key atrule">uri</span><span class="token punctuation">:</span> mlem<span class="token punctuation">-</span>model
<span class="token key atrule">description</span><span class="token punctuation">:</span> Random Forest Classifier
<span class="token key atrule">labels</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> random<span class="token punctuation">-</span>forest
<span class="token punctuation">-</span> classifier
<span class="token key atrule">model_type</span><span class="token punctuation">:</span>
<span class="token key atrule">methods</span><span class="token punctuation">:</span>
<span class="token key atrule">predict_proba</span><span class="token punctuation">:</span>
<span class="token key atrule">args</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> data
<span class="token key atrule">type_</span><span class="token punctuation">:</span>
<span class="token key atrule">columns</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> sepal length (cm)
<span class="token punctuation">-</span> sepal width (cm)
<span class="token punctuation">-</span> petal length (cm)
<span class="token punctuation">-</span> petal width (cm)
<span class="token key atrule">dtypes</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> float64
<span class="token punctuation">-</span> float64
<span class="token punctuation">-</span> float64
<span class="token punctuation">-</span> float64
<span class="token key atrule">index_cols</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token key atrule">type</span><span class="token punctuation">:</span> dataframe
<span class="token key atrule">name</span><span class="token punctuation">:</span> predict_proba
<span class="token key atrule">returns</span><span class="token punctuation">:</span>
<span class="token key atrule">dtype</span><span class="token punctuation">:</span> float64
<span class="token key atrule">shape</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token null important">null</span>
<span class="token punctuation">-</span> <span class="token number">3</span>
<span class="token key atrule">type</span><span class="token punctuation">:</span> ndarray
<span class="token key atrule">type</span><span class="token punctuation">:</span> sklearn
<span class="token key atrule">object_type</span><span class="token punctuation">:</span> model
<span class="token key atrule">requirements</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> sklearn
<span class="token key atrule">version</span><span class="token punctuation">:</span> 1.0.2
<span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> pandas
<span class="token key atrule">version</span><span class="token punctuation">:</span> 1.4.1
<span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> numpy
<span class="token key atrule">version</span><span class="token punctuation">:</span> 1.22.3</code></pre></div>
<p>To make ML model development Git-native, MLEM can work with DVC to manage
versions of a model stored remotely in the cloud. Committing both model
metainformation (<code>mlem-model.mlem</code>) and a pointer to the model binary
(<code>mlem-model.dvc</code> or <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> if you train it in a DVC pipeline) to Git allows
you to enable GitFlow and other Software Engineering best practices like GitOps.</p>
<h1 id="running-your-models-anywhere" style="position:relative;">Running your models anywhere<a href="#running-your-models-anywhere" aria-label="running your models anywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>The main goal of MLEM is to provide you with a single tool that enables any kind
of model productionization scenarios. For MLEM, there are three main groups of
those scenarios:</p>
<ul>
<li><strong>Use</strong> a model directly with MLEM.</li>
<li><strong>Export</strong> a model to a format that can be used by other tools.</li>
<li><strong>Deploy</strong> a model to a production environment or cloud provider.</li>
</ul>
<p>The first one allows you to import your model into a Python runtime, run predict
against some dataset directly in the command line, or serve the model with MLEM
from your CLI.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">$ python
<span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">import</span> mlem
<span class="token operator">>></span><span class="token operator">></span> model <span class="token operator">=</span> mlem<span class="token punctuation">.</span>api<span class="token punctuation">.</span>load<span class="token punctuation">(</span><span class="token string">"mlem-model"</span><span class="token punctuation">)</span>
<span class="token operator">>></span><span class="token operator">></span> model<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>test<span class="token punctuation">)</span>
<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">0.5</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">]</span></code></pre></div>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem apply mlem-model test.csv
<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token number">0.4</span>, <span class="token number">0.3</span>, <span class="token number">0.3</span><span class="token punctuation">]</span>, <span class="token punctuation">[</span><span class="token number">0.2</span>, <span class="token number">0.5</span>, <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">]</span></code></pre></div>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem serve ml-model fastapi
⏳️ Loading model from ml-model.mlem
Starting fastapi server<span class="token punctuation">..</span>.
💅 Adding route <span class="token keyword">for</span> /predict
💅 Adding route <span class="token keyword">for</span> /predict_proba
Checkout openapi docs at <span class="token operator"><</span>http://0.0.0.0:8080/docs<span class="token operator">></span>
INFO: Started server process <span class="token punctuation">[</span><span class="token number">5750</span><span class="token punctuation">]</span>
INFO: Waiting <span class="token keyword">for</span> application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 <span class="token punctuation">(</span>Press CTRL+C to quit<span class="token punctuation">)</span></code></pre></div>
<p>The second one allows you to export your models as a Python package, build a
Docker Image, or export it as some special format (like <code>.onnx</code> which is coming
soon).</p>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem build mlem-model pip <span class="token parameter variable">-c</span> <span class="token assign-left variable">package_name</span><span class="token operator">=</span>mlem-translate <span class="token parameter variable">-c</span> <span class="token assign-left variable">target</span><span class="token operator">=</span>build/
⏳️ Loading model from ml-model.mlem
💼 Written <span class="token variable"><span class="token variable">`</span>ml-package<span class="token variable">`</span></span> package data to <span class="token variable"><span class="token variable">`</span>build/<span class="token variable">`</span></span>
$ tree build/
build
├── MANIFEST.in
├── ml-package
│ ├── __init__.py
│ ├── model
│ └── model.mlem
├── requirements.txt
└── setup.py</code></pre></div>
<p>The last one allows you to deploy models to deployment providers, such as Heroku
(with AWS Sagemaker and Kubernetes coming soon).</p>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem deployment run myservice <span class="token parameter variable">-m</span> mlem-model <span class="token parameter variable">-t</span> staging <span class="token parameter variable">-c</span> <span class="token assign-left variable">app_name</span><span class="token operator">=</span>mlem-quick-start
⏳️ Loading deployment from my-service.mlem
🔗 Loading <span class="token function">link</span> to staging.mlem
🔗 Loading <span class="token function">link</span> to mlem-model.mlem
💾 Updating deployment at my-service.mlem
🛠 Creating <span class="token function">docker</span> image <span class="token keyword">for</span> heroku
🛠 Building MLEM wheel file<span class="token punctuation">..</span>.
💼 Adding model files<span class="token punctuation">..</span>.
🛠 Generating dockerfile<span class="token punctuation">..</span>.
💼 Adding sources<span class="token punctuation">..</span>.
💼 Generating requirements file<span class="token punctuation">..</span>.
🛠 Building <span class="token function">docker</span> image registry.heroku.com/mlem-quick-start/web<span class="token punctuation">..</span>.
✅ Built <span class="token function">docker</span> image registry.heroku.com/mlem-quick-start/web
🔼 Pushing image registry.heroku.com/mlem-quick-start/web to registry.heroku.com
✅ Pushed image registry.heroku.com/mlem-quick-start/web to registry.heroku.com
💾 Updating deployment at my-service.mlem
🛠 Releasing app mlem-quick-start formation
💾 Updating deployment at my-service.mlem
✅ Service mlem-quick-start is up. You can check it out at https://mlem-quick-start.herokuapp.com/</code></pre></div>
<p>Since MLEM is both CLI-first and API-first tool, you can productionize your
models just as easy with Python API:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">$ python
<span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> serve<span class="token punctuation">,</span> build<span class="token punctuation">,</span> deploy</code></pre></div>
<h1 id="git-native-model-registry" style="position:relative;">Git-native model registry<a href="#git-native-model-registry" aria-label="git native model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>MLEM is a core building block for a Git-based ML model registry, together with
other Iterative tools, like GTO and DVC.</p>
<p>ML model registries give your team key capabilities:</p>
<ul>
<li>Collect and organize model <a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">versions</a> from different sources effectively,
preserving their data provenance and lineage information.</li>
<li>Share metadata including <a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">metrics and plots</a> to help use and evaluate
models.</li>
<li>A standard interface to access all your ML artifacts, from early-stage
<a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">experiments</a> to production-ready models.</li>
<li>Deploy specific models on different environments (dev, shadow, prod, etc.)
without touching the applications that consume them.</li>
<li>For security, control who can manage models, and audit their usage trails.</li>
</ul>
<p>Many of these benefits are built into DVC: Your <a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">modeling process</a> and
<a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">performance data</a> become <strong>codified</strong> in Git-based <abbr>DVC
repositories</abbr>, making it possible to reproduce and manage models with
standard Git workflows (along with code). Large model files are stored
separately and efficiently, and can be pushed to <a href="https://dvc.org/doc/command-reference/remote" target="_blank" rel="nofollow noopener noreferrer">remote storage</a> — a scalable
access point for <a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">sharing</a>.</p>
<p>To make a Git-native registry, one option is to use <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> (Git Tag Ops). It tags
ML model releases and promotions, and links them to artifacts in the repo using
versioned annotations. This creates abstractions for your models, which lets you
<strong>manage their lifecycle</strong> freely and directly from Git.</p>
<div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ gto show
╒══════════════════════╤══════════╤════════╤═════════╕
│ name │ latest │ <span class="token comment">#stage │ #prod │</span>
╞══════════════════════╪══════════╪════════╪═════════╡
│ pet-face-recognition │ v3.1.0 │ - │ v3.0.0 │
│ mlem-blep-classifier │ v0.4.1 │ v0.4.1 │ - │
│ dog-bark-translator │ v0.0.1 │ - │ v0.0.1 │
╘══════════════════════╧══════════╧════════╧═════════╛
$ mlem apply dog-bark-translator ./short-dog-phrase.wav
🐶🚀🎉</code></pre></div>
<p>For more information, visit our
<a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer">model registry page</a>.</p>
<h1 id="what-next" style="position:relative;">What next?<a href="#what-next" aria-label="what next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>⭐ <strong>Star <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM on GitHub</a></strong> and let us know
what you think!</p>
<p><img src="https://dvc.org/2022-06-01/mlem-repo-umbrella-dog-6c0b38915bbeb06b0edc21acd71bb3b6.gif" alt="Umbrella dog" title="Machine Learning should be mlemming!"></p>
<p>Machine Learning should be mlemming! 🚀</p>
<p>Resources:</p>
<ul>
<li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">Documentation</a></li>
<li><a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">MLEM website</a></li>
<li><a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM on GitHub</a></li>
<li><a href="https://iterative.ai/model-registry/" target="_blank" rel="nofollow noopener noreferrer">Building an ML model registry</a></li>
</ul>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/aws-remotes-in-dvchttps://dvc.org/blog/aws-remotes-in-dvcTue, 31 May 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s
common to store them in cloud storage. You’ll also be working with different
versions of the same datasets to train a model, so it’s crucial to have a tool
that enables you to switch between datasets quickly and easily. That’s why we’re
going to do a quick walkthrough of how to set up a remote in an AWS S3 bucket
and handle data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p>
<p>We’ll start by creating a new S3 bucket in our AWS account, then we’ll show how
you can add DVC to your project. We’ll be working with
<a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an
example to play with.</p>
<admon type="info">
<p>By the time you finish, you should be able to create this setup for any machine
learning project using an AWS remote.</p>
</admon>
<h2 id="set-up-an-aws-s3-bucket" style="position:relative;">Set up an AWS S3 bucket<a href="#set-up-an-aws-s3-bucket" aria-label="set up an aws s3 bucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Make sure that you already have an <a href="https://aws.amazon.com/" target="_blank" rel="nofollow noopener noreferrer">AWS account</a> and
log in. Search for <code>S3</code> and it should be the first service that appears.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0d7844098f66ea6a742edb060adc4920/39600/finding_s3.png" alt="S3 service in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Once you’re on the S3 page, click the <code>Create Bucket</code> button and it will take
you to a page that looks like this. The bucket in this example is called
<code>updatedbikedata</code> because that is the data our demo repo works with. You can
leave the default settings in place or you can update them to fit the
functionality you need.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fc60057d7f3d97a8faa8abbaa9ddda79/39600/create_bucket.png" alt="create an S3 bucket in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Once you’ve created the bucket, you should be redirected to the S3 dashboard and
see the success message and your new bucket.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b07616e56bf4dbe8f93680ebb5de0d22/39600/created_bucket.png" alt="newly created S3 bucket in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="get-your-credentials" style="position:relative;">Get your credentials<a href="#get-your-credentials" aria-label="get your credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now that the S3 bucket is ready, we need the <code>access_key_id</code> and
<code>secret_access_key</code> from AWS in order to connect to our project. You can create
these keys in your Identity and Access Management settings. Go to your security
credentials and select the <code>Access keys</code> section. Then click the
<code>Create New Access Key</code> button. This will generate a new set of keys for you so
make sure you download this file to get your secret access key.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/60472bc0f92d5ea991eb531263245603/39600/make_credentials.png" alt="make AWS access credentials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Once you’ve downloaded the credentials, you should see the access key ID in the
table. Note that you won’t be able to access your secret key again at this
point. You would need to make a new set of credentials if you don’t have it.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2c89ef39ec30bdaf80ddea219fb9e433/39600/credentials.png" alt="successfully created AWS access credentials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>That’s it for setting up your bucket and getting the credentials you need! Now
let’s add DVC to our demo repo and set up the remote.</p>
<h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First, add DVC as a requirement to your project with the following installation
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token string">'dvc[s3]'</span></span></code></pre></div>
<p>Then you can initialize DVC in your own project with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div>
<p>This will add all of the DVC internals needed to start versioning your data and
tracking experiments. Now we need to set up the remote to connect our project
data stored in AWS to the DVC repo.</p>
<h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now we can add a default to the project with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> bikes s3://updatedbikedata</span></code></pre></div>
<p>This creates a default remote called <code>bikes</code> that connects to the
<code>updatedbikedata</code> bucket we made earlier which is where the training data for
the model will be stored.</p>
<h3 id="add-aws-credentials" style="position:relative;">Add AWS credentials<a href="#add-aws-credentials" aria-label="add aws credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In order for DVC to be able to push and pull data from the remote, you need to
have valid AWS credentials.</p>
<p>By default, DVC authenticates using your AWS CLI configuration, if it has been
set. You can do that with the <code>aws configure</code> command like in this example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">aws</span> configure
</span>AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]:
Default output format [None]:</code></pre></div>
<p>You can check out more details on this command
<a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html" target="_blank" rel="nofollow noopener noreferrer">here in the AWS docs</a>.</p>
<p>If you want to
<a href="https://dvc.org/doc/command-reference/remote/modify#amazon-s3" target="_blank" rel="nofollow noopener noreferrer">use a different authentication method</a>
or if you run into issues with the credentials, you can manually add them with
the following commands:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes access_key_id <span class="token string">'mykey'</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes secret_access_key <span class="token string">'mysecret'</span></span></code></pre></div>
<h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now you can push data from your local machine to the AWS remote! First, add the
data you want DVC to track with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data</span></code></pre></div>
<p>This will allow DVC to track the entire <code>data</code> directory so it will note when
any changes are made. Then you can push that data to your AWS remote with this
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div>
<p>Here's what the data might look like in your AWS bucket.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8995e2ee92b7f960ca732dad0b0d802e/39600/aws_bucket.png" alt="data in AWS bucket" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Then if you move to a different machine or someone else needs to use that data,
it can be accessed by cloning or forking the project repo and running:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div>
<p>This will get any data from your remote and download it to your local machine.</p>
<admon type="info">
<p>Authentication has to be setup locally on any machine you need to pull or push
data from.</p>
</admon>
<hr>
<p>That’s it! Now you can connect any DVC project to an AWS S3 bucket. If you run
into any issues, makes sure to check that your credentials are valid, check if
your user has MFA enabled, and check that the user has the right level of
permissions.</p>https://dvc.org/blog/may-22-community-gemshttps://dvc.org/blog/may-22-community-gemsThu, 26 May 2022 00:00:00 GMT<h3 id="is-it-possible-to-export-a-plot-generated-using-dvc-plots-diff-head-main-to-vega-lite-for-use-in-cml" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/965911829538832435" target="_blank" rel="nofollow noopener noreferrer">Is it possible to export a plot generated using <code>dvc plots diff HEAD main</code> to vega-lite for use in CML?</a><a href="#is-it-possible-to-export-a-plot-generated-using-dvc-plots-diff-head-main-to-vega-lite-for-use-in-cml" aria-label="is it possible to export a plot generated using dvc plots diff head main to vega lite for use in cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the awesome question @dominic!</p>
<p>You can use the <a href="https://dvc.org/doc/command-reference/plots/diff#--show-vega"><code>dvc plots diff --show-vega</code></a> command to export the plot to
vega-lite on a single graph. You'll need to run the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> HEAD main <span class="token parameter variable">--targets</span> prediction.json <span class="token parameter variable">--show-vega</span> <span class="token operator">></span> vega.json</span></code></pre></div>
<p>You can also include this plot in a comment with CML so that it appears on your
pull requests in GitHub.</p>
<h3 id="what-is-the-difference-between-dvc-pull-and-dvc-checkout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/966739538888241192" target="_blank" rel="nofollow noopener noreferrer">What is the difference between <code>dvc pull</code> and <code>dvc checkout</code>?</a><a href="#what-is-the-difference-between-dvc-pull-and-dvc-checkout" aria-label="what is the difference between dvc pull and dvc checkout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question @Derek!</p>
<p>Here are some explanations around how <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> work.
They're comparable to <code>git pull</code> and <code>git checkout</code>.</p>
<ul>
<li><a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> fetches data from your remote cache to your local cache and syncs
it to your workspace</li>
<li><a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> syncs data from your local cache to your workspace</li>
</ul>
<h3 id="is-there-a-way-to-add-all-of-the-outs-of-a-foreach-job-to-the-deps-of-a-downstream-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/967709548393672734" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add all of the <code>outs</code> of a <code>foreach</code> job to the <code>deps</code> of a downstream stage?</a><a href="#is-there-a-way-to-add-all-of-the-outs-of-a-foreach-job-to-the-deps-of-a-downstream-stage" aria-label="is there a way to add all of the outs of a foreach job to the deps of a downstream stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Very interesting question from @mathematiguy!</p>
<p>One way to do this is to have all <code>foreach</code> stages write out to different paths
within the same directory and then track the entire directory as a dependency of
your downstream stage.</p>
<p>Here's an example of how that might look in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">cleanups</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> raw1
<span class="token punctuation">-</span> labels1
<span class="token punctuation">-</span> raw2
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> echo "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>" <span class="token punctuation">></span> "data/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>"
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>
<span class="token key atrule">reduce</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> echo file <span class="token punctuation">></span> file
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> file</code></pre></div>
<h3 id="is-there-a-way-to-version-and-move-data-from-one-cloud-storage-to-another-with-dvc-remotes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/968778284114538496" target="_blank" rel="nofollow noopener noreferrer">Is there a way to version and move data from one cloud storage to another with DVC remotes?</a><a href="#is-there-a-way-to-version-and-move-data-from-one-cloud-storage-to-another-with-dvc-remotes" aria-label="is there a way to version and move data from one cloud storage to another with dvc remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Wonderful question from @Hisham!</p>
<p>There are a couple of ways you can do this. One approach is to use
<a href="https://dvc.org/doc/command-reference/add#--to-remote"><code>dvc add --to-remote</code></a>.</p>
<p>The other approach is to use the
<a href="https://dvc.org/doc/command-reference/import-url#example-transfer-to-remote-storage" target="_blank" rel="nofollow noopener noreferrer"><code>import-url --to-remote</code></a>
functionality. The main difference between these approaches is that
<a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> has the added benefit of keeping a connection to the data
source so it can be updated later with <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>.</p>
<p>You can see an example of how to do this in the docs. Just make sure that you
have your remotes set up!</p>
<h3 id="if-im-using-feast-feature-store-is-it-possible-to-version-datasets-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/968899175561449532" target="_blank" rel="nofollow noopener noreferrer">If I'm using Feast feature store, is it possible to version datasets with DVC?</a><a href="#if-im-using-feast-feature-store-is-it-possible-to-version-datasets-with-dvc" aria-label="if im using feast feature store is it possible to version datasets with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a good integration question from @Bernardo Galvao!</p>
<p>If you want to fetch historical features from the offline store to generate
training data, a typical pattern would be to write the script to do so and set
up a DVC pipeline stage to track that script and version the output file. This
is similar to how a lot of people use DVC alongside SQL databases.</p>
<h3 id="how-can-i-run-a-dvc-pipeline-in-a-docker-container" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/969640280263389184" target="_blank" rel="nofollow noopener noreferrer">How can I run a DVC pipeline in a Docker container?</a><a href="#how-can-i-run-a-dvc-pipeline-in-a-docker-container" aria-label="how can i run a dvc pipeline in a docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Nice question from @Anudeep!</p>
<p>Here's an example of a Dockerfile with a simple DVC setup.</p>
<div class="gatsby-highlight" data-language="docker"><pre class="language-docker"><code class="language-docker"><span class="token instruction"><span class="token keyword">FROM</span> ubuntu:latest</span>
<span class="token instruction"><span class="token keyword">RUN</span> apt-get update && apt install -y python-is-python3 python3-pip</span>
<span class="token instruction"><span class="token keyword">WORKDIR</span> /dvc_project</span>
<span class="token instruction"><span class="token keyword">COPY</span> . .</span>
pip install -r requirements.txt # assuming your requirements, including dvc, are here
<span class="token instruction"><span class="token keyword">CMD</span> dvc pull && dvc exp run</span></code></pre></div>
<p>You would save this file and then run the following commands in your terminal.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> build <span class="token parameter variable">-t</span> <span class="token string">"myproject-dvc-exp-run"</span> <span class="token builtin class-name">.</span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run myproject-dvc-exp-run</span></code></pre></div>
<p>You could also use the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command or any of the other DVC commands.</p>
<h3 id="how-can-i-reset-a-repository-and-start-fresh-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/970344379938127892" target="_blank" rel="nofollow noopener noreferrer">How can I reset a repository and start fresh with DVC?</a><a href="#how-can-i-reset-a-repository-and-start-fresh-with-dvc" aria-label="how can i reset a repository and start fresh with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Nice question from @strickvl!</p>
<p>The best approach for resetting a repo is to run the <a href="https://dvc.org/doc/command-reference/destroy"><code>dvc destroy</code></a> command that
will remove all DVC file and internals from your repository.</p>
<h3 id="is-there-an-example-of-using-cml-with-gcp-that-can-be-used-as-a-reference" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/963512513452970086" target="_blank" rel="nofollow noopener noreferrer">Is there an example of using CML with GCP that can be used as a reference?</a><a href="#is-there-an-example-of-using-cml-with-gcp-that-can-be-used-as-a-reference" aria-label="is there an example of using cml with gcp that can be used as a reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Excellent question from @sabygo!</p>
<p>Here is a GitHub Actions snippet to get you started:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">setup</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">GOOGLE_APPLICATION_CREDENTIALS_DATA</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GCP_CML_RUNNER_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--single \
--labels=cml-gcp \
--token=${{ secrets.GCP_SECRET }} \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-highcpu-2</span>
<span class="token key atrule">test</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>setup<span class="token punctuation">]</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>gcp<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token comment"># - uses: iterative/setup-cml@v1</span>
<span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
echo "model training"</span></code></pre></div>
<h3 id="can-i-use-preemptive-instances-provided-by-gcp-as-a-cml-runner" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/964860322710192202" target="_blank" rel="nofollow noopener noreferrer">Can I use preemptive instances provided by GCP as a <code>cml-runner</code>?</a><a href="#can-i-use-preemptive-instances-provided-by-gcp-as-a-cml-runner" aria-label="can i use preemptive instances provided by gcp as a cml runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question from @Atsu!</p>
<p>Yes! You can use <code>cml runner --cloud-spot</code> to request a preemptive instance.</p>
<hr>
<p><img src="https://media.giphy.com/media/bg1MQ6IUVoVOM/giphy.gif" alt="We Did It Smiling GIF"></p>
<p>At our June Office Hours Meetup we will be the launch party for our new MLOps
tool! Make sure you join us to find out what it is!
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/local-experiments-to-cloud-with-tpi-dockerhttps://dvc.org/blog/local-experiments-to-cloud-with-tpi-dockerTue, 24 May 2022 00:00:00 GMT<p>We recently <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">published a tutorial</a> on using <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider
Iterative (TPI)</a> to move a machine learning experiment from your local
computer to a more powerful cloud machine. We've covered how you can use
<a href="https://www.terraform.io" target="_blank" rel="nofollow noopener noreferrer">Terraform</a> & TPI to provision infrastructure, sync
data, and run training scripts. To simplify the setup, we used a pre-configured
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-image" target="_blank" rel="nofollow noopener noreferrer">Ubuntu/NVIDIA image</a>.
However, instead of using a pre-configured image, we can use custom
<a href="https://www.docker.com" target="_blank" rel="nofollow noopener noreferrer">Docker</a> images. These are often
<a href="https://aws.amazon.com/blogs/opensource/why-use-docker-containers-for-machine-learning-development/" target="_blank" rel="nofollow noopener noreferrer">recommended in machine learning</a>
as well as in traditional software development.</p>
<admon type="info">
<p>Using Docker to manage dependencies (e.g. Python packages) does not remove all
other setup requirements. You'll still need Docker itself installed, as well as
GPU runtime drivers if applicable. Happily, TPI sets up all of this by default.</p>
</admon>
<p>When confronted with cloud infrastructure and dependencies, people often think
"oh no, not again" (much
<a href="https://www.youtube.com/watch?v=THSY7-CxKnQ" target="_blank" rel="nofollow noopener noreferrer">like the petunias</a> in the cover
image). To solve this, separating dependencies into Docker images gives more
control over software versions, and also makes it painless to switch between
cloud providers — currently Amazon Web Services (AWS), Microsoft Azure, Google
Cloud Platform, and Kubernetes. Your Docker image is cloud provider-agnostic.
There are thousands of
<a href="https://hub.docker.com/" target="_blank" rel="nofollow noopener noreferrer">pre-defined Docker images online</a> too.</p>
<p>In this tutorial, we'll use an existing Docker image that comes with most of our
requirements already installed. We'll then add add a few more dependencies on
top and run our training pipeline in the cloud as before!</p>
<h2 id="run-gpu-enabled-docker-containers" style="position:relative;">Run GPU-enabled Docker containers<a href="#run-gpu-enabled-docker-containers" aria-label="run gpu enabled docker containers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<admon type="warn">
<p>If you haven't read the <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">previous tutorial</a>, you should check out
the basics there first. This includes how to let Terraform know about TPI, and
essential commands (<code>init</code>, <code>apply</code>, <code>refresh</code>, <code>show</code>, and <code>destroy</code>).</p>
</admon>
<p>The only modification from the <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">previous tutorial</a> is the script
part of the <code>main.tf</code> config file.</p>
<p>Let's say we've found a carefully prepared a Docker image suitable for data
science and machine learning — in this case,
<a href="https://cml.dev/doc/self-hosted-runners#docker-images" target="_blank" rel="nofollow noopener noreferrer"><code>iterativeai/cml:0-dvc2-base1-gpu</code></a>.
This image comes loaded with Ubuntu 20.04, Python 3.8, NodeJS, CUDA 11.0.3,
CuDNN 8, Git, <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>, <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, and other
essentials for full-stack data science.</p>
<p>Our <code>script</code> block is now:</p>
<div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END
#!/bin/bash
docker run --gpus all -v "$PWD:/tpi" -w /tpi -e TF_CPP_MIN_LOG_LEVEL \
iterativeai/cml:0-dvc2-base1-gpu /bin/bash -c "
pip install -r requirements.txt tensorflow==2.8.0
python train.py --output results-gpu/metrics.json
"
END</span></code></pre></div>
<p>Yes, it's quite long for a one-liner. Let's looks at the components:</p>
<ul>
<li><code>docker run</code>: Download the specified image, create a container from the image,
and run it.</li>
<li><code>--gpus all</code>: Expose GPUs to the container.</li>
<li><code>-v "$PWD:/tpi"</code>: Expose our current working directory (<code>$PWD</code>) within the
container (as path <code>/tpi</code>).</li>
<li><code>-w /tpi</code>: Set the working directory of the container (to be <code>/tpi</code>).</li>
<li><code>-e TF_CPP_MIN_LOG_LEVEL</code>: Expose the environment variable
<code>TF_CPP_MIN_LOG_LEVEL</code> to the container (in this case to control TensorFlow's
verbosity).</li>
<li><code>iterativeai/cml:0-dvc2-base1-gpu</code>: The image we want to download & run a
container from.</li>
<li><code>/bin/bash -c "pip install -r requirements.txt ... python train.py ..."</code>:
Commands to run within the container's working directory. In this case,
install the dependencies and run the training script.</li>
</ul>
<p>We can now call <code>terraform init</code>, <code>export TF_LOG_PROVIDER=INFO</code>, and
<code>terraform apply</code> to provision infrastructure, upload our data and code, set up
the cloud environment, and run the training process. If you'd like to tinker
with this example you can
<a href="https://github.com/iterative/blog-tpi-bees/tree/docker" target="_blank" rel="nofollow noopener noreferrer">find it on GitHub</a>.</p>
<admon type="tip">
<p>Don't forget to <code>terraform refresh && terraform show</code> to check the status, and
<code>terraform destroy</code> to download results & shut everything down.</p>
</admon>
<p>Now you know the basics of using convenient Docker images together with
<a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a> for provisioning your MLOps infrastructure!</p>
<admon type="tip">
<p>If you have a lot of custom dependencies that rarely change (e.g. a large
<code>requirements.txt</code> that is rarely updated), it's a good idea to build it into
your own custom Docker image. Let us know if you'd like a tutorial on this!</p>
</admon>https://dvc.org/blog/may-22-heartbeathttps://dvc.org/blog/may-22-heartbeatMon, 16 May 2022 00:00:00 GMT<h1 id="aiml-news" style="position:relative;">AI/ML News<a href="#aiml-news" aria-label="aiml news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9638ba164ff65ee833ba12eb47f5694d/0988f/chip-huyen.jpg" alt="Designing Machine Learning Systems" title="Designing Machine Learning Systems" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="chip-huyen-designing-machine-learning-systems" style="position:relative;">Chip Huyen: Designing Machine Learning Systems<a href="#chip-huyen-designing-machine-learning-systems" aria-label="chip huyen designing machine learning systems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/chiphuyen/" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen</strong></a> just came out with a
new book with <a href="https://oreilly.com" target="_blank" rel="nofollow noopener noreferrer">O'Reilly</a> entitled
<a href="https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/" target="_blank" rel="nofollow noopener noreferrer">Designing Machine Learning Systems</a>.<br>
I'm
not going to pontificate here; Chip Huyen wrote it, the reviews are shining,
need I say more?</p>
<h2 id="jenny-abramov-an-agile-framework-for-ai-projects--development-qa-deployment-and-maintenance" style="position:relative;">Jenny Abramov: An Agile Framework for AI Projects — Development, QA, Deployment and Maintenance<a href="#jenny-abramov-an-agile-framework-for-ai-projects--development-qa-deployment-and-maintenance" aria-label="jenny abramov an agile framework for ai projects development qa deployment and maintenance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/jennyabramov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jenny Abramov</strong></a>
<a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">wrote a piece</a>
in <a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Toward Data Science</a> with the purpose to
present an "iterative-lifecycle framework," that is adapted to AI-centered
software. She outlines important considerations as you work through the
framework that depends on your use case, data, and business problem.</p>
<p>She suggests using DVC for your larger, more complex datasets and also about the
need for reproducibility in experimentation with which DVC can help you
<a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">(see Technical Product Manager, Dave Berenbaum’s post on experiment versioning.)</a></p>
<p>In addition, she discusses issues with quality assurance in deployment and the
maintenance of the system.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ade289bcbc573399c5dd5db19fc2d749/39600/jenny-abramov.png" alt="Jenny Abromov iterative-lifecycle framework" title="Jenny Abromov iterative-lifecycle framework" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Jenny Abramov's iterative-lifecycle framework
(<a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="mlops-guide-from-innoq" style="position:relative;">MLOps Guide from INNOQ<a href="#mlops-guide-from-innoq" aria-label="mlops guide from innoq permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/larysavisenger/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dr. Larysa Visengeriyeva</strong></a>,
<a href="https://www.linkedin.com/in/anja-kammer-berlin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anja Kammer,</strong></a>
<a href="https://www.linkedin.com/in/isabel-b%C3%A4r-a89705213/" target="_blank" rel="nofollow noopener noreferrer"><strong>Isabel Bär,</strong></a>
<a href="https://www.linkedin.com/in/alexander-kniesz-656256197/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alexander Kniesz,</strong></a>
and <a href="https://www.linkedin.com/in/michael-ploed/" target="_blank" rel="nofollow noopener noreferrer"><strong>Michael Plöd</strong></a> of
<a href="https://www.innoq.com/en/" target="_blank" rel="nofollow noopener noreferrer"><strong>INNOQ</strong></a> (a software development, strategy, and
technology consultancy) created
<a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">this</a> very thorough resource on
MLOps, going through all the principles and "iterative-incremental" steps of the
process (there's an iterative pattern here 😉). The authors cover Automation,
Continuous X (hello CML and TPI), Versioning (hello DVC!), Experiments Tracking
(noted DVC here because indeed DVC does experiment tracking too!), Testing,
Monitoring, the "ML Test Score" System, Reproducibility, Modularity, ML-based
Software Delivery Metrics, and MLOps Principles and Best Practices. Definitely a
good resource for for MLOps and filled with more resources as well.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0eb187e7f2573e5f7cb8160fb74297e1/03346/innoq.jpg" alt="INNOQ MLOps Guide" title="INNOQ MLOps Guide" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>INNOQ
MLOps Guide (<a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>Also interesting from INNOQ is their
<a href="https://www.innoq.com/en/artists/" target="_blank" rel="nofollow noopener noreferrer">Artist-in-residence program</a> created because
they "believe in the conscious reflection between technology and society" and
feel art is well suited for this refection. See the work below by Studio Waltz
Binaire based on the question: What traces do we leave behind with technology?</p>
<p><img src="https://media.giphy.com/media/NxdrJ6a4IQKyW5gGjL/giphy.gif" alt="Waltz Binaire GIF"></p>
<p>(<a href="https://www.innoq.com/en/artists/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</p>
<h2 id="laszlo-sragner-linkedin-discussion-on-code-quality" style="position:relative;">Laszlo Sragner: LinkedIn discussion on Code Quality<a href="#laszlo-sragner-linkedin-discussion-on-code-quality" aria-label="laszlo sragner linkedin discussion on code quality permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/laszlosragner/?trk=public_post-embed_share-update_actor-text&originalSubdomain=uk" target="_blank" rel="nofollow noopener noreferrer"><strong>Laszlo Sragner</strong></a>
a frequent contributor to the MLOps Community in general, often driving
discussions and helping others in the
<a href="https://mlops-community.slack.com/join/shared_invite/zt-178s99cyv-Q~whRpqbhgMTBrOcbjnDIQ#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">MLOps Community Slack channel,</a>
posed an interesting point about code quality on LinkedIn. Join the discussion
and weigh in at this post:</p>
<div class="gatsby-resp-iframe-wrapper" style="padding-bottom: 158.73015873015873%; position: relative; height: 0; overflow: hidden; "> <iframe src="https://www.linkedin.com/embed/feed/update/urn:li:share:6931541880090324992" frameborder="0" allowfullscreen title="Embedded post" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="icymi-we-released-tpi-" style="position:relative;">ICYMI: We released TPI! 🎉<a href="#icymi-we-released-tpi-" aria-label="icymi we released tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>On April 27th we released the latest offering to our tool ecosystem.</p>
<p><img src="https://media.giphy.com/media/ut7lqhIfOscbjuU6YQ/giphy.gif" alt="Celebrate GIF"></p>
<p><a href="https://tpi.cml.dev" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a> is a Terraform plugin
built with machine learning in mind. Full lifecycle management of computing
resources (including GPUs and respawning spot instances) from several cloud
vendors (AWS, Azure, GCP, K8s)… without needing to be a cloud expert.</p>
<ul>
<li>
<p><strong>Lower cost with spot recovery:</strong> transparent data checkpoint/restore &
auto-respawning of low-cost spot/preemptible instances</p>
</li>
<li>
<p><strong>No cloud vendor lock-in:</strong> switch between clouds with just one line thanks
to unified abstraction</p>
</li>
<li>
<p><strong>No waste:</strong> auto-cleanup unused resources (terminate compute instances upon
task completion/failure & remove storage upon download of results), pay only
for what you use</p>
</li>
<li>
<p><strong>Developer-first experience:</strong> one-command data sync & code execution with no
external server, making the cloud feel like a laptop</p>
</li>
<li>
<p>⭐️ <a href="https://tpi.cml.dev" target="_blank" rel="nofollow noopener noreferrer">Star the Repo</a></p>
</li>
<li>
<p>✍🏼 <a href="https://dvc.org/blog/terraform-provider" target="_blank" rel="nofollow noopener noreferrer">Read the release blog post</a></p>
</li>
<li>
<p>⚙️ Read:
<a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi" target="_blank" rel="nofollow noopener noreferrer">Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) tutorial</a></p>
</li>
<li>
<p>🎥 <a href="https://www.youtube.com/watch?v=2fEgO8SazSE&t=2s" target="_blank" rel="nofollow noopener noreferrer">Watch the video</a></p>
</li>
<li>
<p>🪐
<a href="https://github.com/iterative/blog-tpi-jupyter" target="_blank" rel="nofollow noopener noreferrer">TPI with Jupyter Notebooks Repo</a></p>
</li>
</ul>
<p>Stay tuned for more tutorials and use cases to come!</p>
<p><img src="https://media.giphy.com/media/MrCYIN3x0SgdG/giphy.gif" alt="Tom Cruise GIF"></p>
<h2 id="mission-impossible---we-have-a-mission-statement" style="position:relative;">🚀<del>Mission Impossible</del> - We have a mission statement!<a href="#mission-impossible---we-have-a-mission-statement" aria-label="mission impossible we have a mission statement permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We did it! This year we surveyed the entire team to arrive at a mission
statement for Iterative. It was no small feat to decide on what it should be
given the early stage of our industry, the variety of our tools, and always a
struggle - figuring out the best and most concise way to convey these ideas (you
know our penchant for abbreviations). But we persevered and succeeded. Behold
Iterative's new mission statement:</p>
<blockquote>
<p>We deliver the best developer experience for machine learning teams by
creating an ecosystem of open, modular ML tools.</p>
</blockquote>
<p>As always the door is open for your feedback on how we can serve your needs
better!</p>
<h2 id="odsc-east" style="position:relative;">ODSC East<a href="#odsc-east" aria-label="odsc east permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We attended our first post-pandemic, in-person conference in Boston last month.
It was awesome to be together as a team, see
<a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a>,
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milicia McGregor</strong></a>, and
<a href="https://www.linkedin.com/in/alex000kim/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> in action, and talk to
attendees and other vendors at the conference. We are looking forward to
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> next month!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2dc474d99a4acbf874536ef6c50f8403/03346/odsc.jpg" alt="Iterative Team at ODSC East" title="Iterative Team at ODSC East" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative team (left to right) - Mike Moynihan, me, Dave Berenbaum, Daniel
Barnes, (DeeVee), Rob De Wit, Milicia McGregor, Dmitry Petrov, Jervis Hui, Alex
Kim, Chaz Black</em></p>
<h2 id="-tons-of-new-content-on-the-blog" style="position:relative;">✍🏼 Tons of new content on the blog<a href="#-tons-of-new-content-on-the-blog" aria-label=" tons of new content on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our team has been on fire creating content for you. 🔥 Don't miss the following:</p>
<ul>
<li>Needing to get started with CML and AWS?
<a href="https://www.linkedin.com/in/rcdewit?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAA5CEPkB9fI02IpClBKhRdq2brULPHMhmR8&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3B9MrcxBhQSG6IKzSgJDyfQQ%3D%3D" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a>
shows you how to train and save your models with CML in a two-part series
using a
<a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">self-hosted AWS EC2 runner</a>
and
<a href="https://dvc.org/blog/CML-runners-saving-models-2" target="_blank" rel="nofollow noopener noreferrer">with CML and DVC on a dedicated AWS EC2 runner</a></li>
<li>The
<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Part 1</a>,
<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments" target="_blank" rel="nofollow noopener noreferrer">Part 2</a>
and
<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cd" target="_blank" rel="nofollow noopener noreferrer">Part 3</a>
tutorials of <a href="https://www.linkedin.com/in/alex000kim/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim's</strong></a>
End-to-End Computer Vision API project are out and filled with great learning!</li>
<li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> brings the monthly
roundup of the Community's best technical questions in our latest
<a href="https://dvc.org/blog/april-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">Community Gems.</a> 💎</li>
</ul>
<h2 id="-shiny-new-docs" style="position:relative;">✨ Shiny New Docs<a href="#-shiny-new-docs" aria-label=" shiny new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have a <a href="https://dvc.org/doc/start/experiments/visualization" target="_blank" rel="nofollow noopener noreferrer">new doc page</a>
showcasing the new visualizations added to the
<a href="https://github.com/iterative/example-dvc-experiments" target="_blank" rel="nofollow noopener noreferrer">example-dvc-experiments repo</a>.<br>
Whether
you need to create plots from tabular data, user-generated plots, or
autogenerating plots from deep learning code, we've got you covered.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d74bcfcd5fcdcca1aa86f048846e2334/39600/dvc-visualization-doc.png" alt="DVC Visualization Doc" title="DVC Visualization Doc" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>DVC Visualization Doc
(<a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="dmitry-petrov-on-tfir-about-terraform-provider-iterative-tpi" style="position:relative;">Dmitry Petrov on TFIR about Terraform Provider Iterative (TPI)<a href="#dmitry-petrov-on-tfir-about-terraform-provider-iterative-tpi" aria-label="dmitry petrov on tfir about terraform provider iterative tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> recently sat down with
<a href="https://twitter.com/SwapBhartiya" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> of
<a href="https://www.tfir.io/" target="_blank" rel="nofollow noopener noreferrer">TFIR</a> to have a chat about TPI. Learn how to save your
team valuable resources in your machine learning projects with Terraform
Provider Iterative (TPI). You can watch the recording below.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/x-xiKzlQFjY?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="-join-our-release-party-meetup" style="position:relative;">🥳 Join our Release Party Meetup<a href="#-join-our-release-party-meetup" aria-label=" join our release party meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have another tool ready to debut on May 24th. On the 25th we'd love to have
you join us for a Release Party Meetup. We will be celebrating the release of
the new addition to our open-source tool ecosystem and have a demo of said tool!
To join the fun,
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" target="_blank" rel="nofollow noopener noreferrer">RSVP to the Meetup </a>
and mark your calendar!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">New Tool Release Party</h4>
<div class="elp-description">Join us May 25th. RSVP for New Tool Release Party!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-05-16/release-party-meetup-c6da47fce4ec95bd0414fe06e5e45c0c.png" alt="New Tool Release Party">
</div>
</a>
</section>
<p></p>
<h2 id="new-hires" style="position:relative;">New hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href=""><strong>Wolmir Nemitz</strong></a> is our first team member from South America! We're getting
closer to covering all the continents on
<a href="https://iterative.ai/about" target="_blank" rel="nofollow noopener noreferrer">our remote team map</a>! From Brazil, Wolmir joins us
as an Engineer for the 🤫 team (you'll find out June 14th). Wolmir has four
dogs, two tortoises, and a budgie! 🦜</p>
<p><a href="https://www.linkedin.com/in/ufijuice/" target="_blank" rel="nofollow noopener noreferrer"><strong>Pavel Chekmaryov</strong></a> joins us in People
Operations, managing the hiring pipeline from Frankfurt, Germany, but soon to be
Canada! He has spent the last eight years in startups, most recently at OccurAI,
reinventing recruitment in the deep-tech/ML field. We look forward to him
helping to grow our amazing team!</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with our amazing new additions to the team, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions and share with anyone you think may be
interested! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="yet-another-tool-comparison-imagine-that" style="position:relative;">Yet another tool comparison, imagine that!<a href="#yet-another-tool-comparison-imagine-that" aria-label="yet another tool comparison imagine that permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/lWa7aAo62YZLwtk3nj/giphy.gif" alt="Cant Believe There You Are GIF"></p>
<p>So each month I tell you about yet another post to help you attempt to make
sense of the vast MLOps tool space. Well, this month is no different. I mean you
could be new here, right? 🤷🏽♀️ <a href="https://dolthub.com" target="_blank" rel="nofollow noopener noreferrer">DoltHub</a> tries to bring some
clarity
<a href="https://www.dolthub.com/blog/2022-04-27-data-version-control/" target="_blank" rel="nofollow noopener noreferrer">with this piece</a>
by comparing different data versioning tools and the intricacies of each. You do
your research. You know we're partial.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/927c16986aa3b3e5f268930bb780460d/39600/data-version-control.png" alt="Data Version Control tools" title="Data Version Control tools" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Data Version control tools
(<a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>I’m starting to wonder if all Data Science/AI teams need a role with the sole
responsibility of the job to keep up to date with all the new tooling and
changes/updates to existing tooling in the MLOps space and what might best work
for the team. What should this position be called? The best answer wins a DVC
t-shirt. See
<a href="https://twitter.com/DVCorg/status/1526286089551433728?s=20&t=nV3FQAso441MtvrckYAOJA" target="_blank" rel="nofollow noopener noreferrer">this Twitter thread</a>
to answer. (Hint: Funny answers will likely win 😉). Deadline: May 31st. Pass it
around…</p>
<h2 id="andrey-cheptsov-notebooks-and-mlops-choose-one" style="position:relative;">Andrey Cheptsov: Notebooks and MLOps. Choose One.<a href="#andrey-cheptsov-notebooks-and-mlops-choose-one" aria-label="andrey cheptsov notebooks and mlops choose one permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/andrey-cheptsov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Andrey Cheptsov</strong></a> writes
<a href="https://mlopsfluff.dstack.ai/p/notebooks-and-mlops-choose-one?s=r" target="_blank" rel="nofollow noopener noreferrer">a piece</a>
pointing out how Jupyter Notebooks, while rightfully loved in data science work,
fail pretty miserably in a production environment and the reliance on them can
cause bad habits. He notes that he's found:</p>
<blockquote>
<p>For any ML model, the time spent in a Jupyter notebook is inversely
proportional to its reproducibility. The reasons behind this rule are poor
modularity and reusability of the code in notebooks, and poor integration with
Git. - Andrey Cheptsov</p>
</blockquote>
<p>He advocates for training your models using Python scripts, Git, and CI/CD to
automatically shift your foucus to creating reusable, testable code, and to use
tools like <a href="https://gradio.app/" target="_blank" rel="nofollow noopener noreferrer">Gradio</a> and <a href="https://streamlit.io/" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a>
to provide the interactivity of Jupyter notebooks. Sounds like a promising idea.
💡</p>
<p><img src="https://media.giphy.com/media/qxtxlL4sFFle/giphy.gif" alt="Confused The Interview GIF"></p>
<h2 id="beyond-ml" style="position:relative;">Beyond ML<a href="#beyond-ml" aria-label="beyond ml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As noted above in our shiny new mission statement, our focus is to make tools
for machine learning teams. It has however come to our attention that more and
more users are using our tools for non-ML use cases.</p>
<p><a href="https://drorspei.wordpress.com/about/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dror Speiser</strong></a> writes about a non-ML
use case in
<a href="https://drorspei.wordpress.com/2021/09/15/a-new-recipe-for-reproducible-cloud-deployments/" target="_blank" rel="nofollow noopener noreferrer">A New Recipe for Idempotent Cloud Deployments</a>
in which he provides a tutorial for doing just that with DVC.</p>
<p>The benefits of the approach are:</p>
<blockquote>
<ol>
<li>Changing one artifact’s code does not force rebuilding other artifacts,
even if you’re building on a new VM every time.</li>
<li>Changing only the deployment script won’t build any artifacts at all.</li>
<li>You have an artifact repository that just works.</li>
<li>Your Git history contains the hashes of all built artifacts.</li>
<li>You can look up any artifact using its hash.</li>
</ol>
</blockquote>
<p>We have opened up a #beyond-ml channel in our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord Server</a>. Do stop by and chat about alternate uses
for our tools!</p>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>📣 Our next in-person conference will be
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> from June 7-10 in Toronto! We look
forward to seeing Community members there!</li>
<li>📣 PyLadies Berlin is hosting <strong>Doreen</strong>, a data scientist working at
<a href="https://opinary.com/" target="_blank" rel="nofollow noopener noreferrer">Opinary</a>, who will be presenting "Reproducible Machine
Learning with DVC and Poetry" on May 17th.
<a href="https://www.meetup.com/PyLadies-Berlin/events/285313817/" target="_blank" rel="nofollow noopener noreferrer">Join the event here.</a></li>
<li>📣 <a href="https://www.linkedin.com/in/nicolas-eiris/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicolás Eiras</strong></a> will be
presenting "Data Versioning: Towards Reproducibility in Machine Learning" at
<a href="https://embeddedvisionsummit.com/2022/session/data-versioning-towards-reproducibility-in-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">Embedded Vision Summit</a>
on May 18th in Santa Clara, California.</li>
<li>📣 <a href="https://www.meetup.com/PyData-MTL/" target="_blank" rel="nofollow noopener noreferrer">Montreal PyData</a> will host a
<a href="https://www.meetup.com/PyData-MTL/events/285894672/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> on June 16th
with two presentations, "Introduction to Trustworthy Machine Learning for the
Enterprise" by <a href="https://www.linkedin.com/in/mohamedleila/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mohamed Leila</strong></a>,
ServiceNow and "ML in production in the video game industry: Ubisoft's use
case" by
<a href="https://www.linkedin.com/in/jeanmicheldaignan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jean-Michel Daignan</strong></a>,
Ubisoft</li>
</ul>
<h2 id="other-fun-stuff" style="position:relative;">Other Fun Stuff<a href="#other-fun-stuff" aria-label="other fun stuff permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://github.com/gaocegege/awesome-open-source-mlops" target="_blank" rel="nofollow noopener noreferrer">New Awesome list</a></li>
<li><a href="https://www.udemy.com/course/dvc-and-git-for-data-science/" target="_blank" rel="nofollow noopener noreferrer">New Udemy Course including DVC</a>
(But don't forget <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">our online course!</a>)</li>
<li>Would you like to get some good practice in? Join this
<a href="https://www.the-odd-dataguy.com/2022/04/28/dvc_kaggle/" target="_blank" rel="nofollow noopener noreferrer">Kaggle competition</a>
created by
<a href="https://www.linkedin.com/in/jeanmicheldaignan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jean-Michel Daignan</strong></a>
based on a previous competition from Petfinder.my with some really cute pet
images.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1118538f2983017664997ed3c565e62d/39600/img_pawpularity.png" alt="DVC Kaggle Competition" title="DVC Kaggle Competition" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>DVC Kaggle Competition based on Pawfinder.my
(<a href="https://www.the-odd-dataguy.com/2022/04/28/dvc_kaggle/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We love it when our Community does conference talks on our tools! 🥰</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The <a href="https://twitter.com/EmbVisionSummit">@EmbVisionSummit</a> starts on Monday and our team is on its way!🚀<br><br>We’ve had our fair share of experience on edge devices. Nicolás and our CTO <a href="https://twitter.com/dekked_">@dekked_</a> will be there; come by to chat about our experiences.<br><br>Also, don't miss Nico's talk! May 18th - 2:05pm <a href="https://t.co/MfnEtOT29Y">https://t.co/MfnEtOT29Y</a> <a href="https://t.co/r9itWhVjis">pic.twitter.com/r9itWhVjis</a></p>— Tryolabs (@tryolabs) <a href="https://twitter.com/tryolabs/status/1525103969885888512">May 13, 2022</a></blockquote>
<p>This Heartbeat was brought to you by the song "Tarkus" from Emerson, Lake, and
Palmer which can be found on our
<a href="https://open.spotify.com/playlist/3eahsf3T9iEJkfWECC7VEp?si=cbcf1f9d3e424d62" target="_blank" rel="nofollow noopener noreferrer">MLOps Playlist,</a>
and the letters <strong>T, P, and I.</strong> 😉 See you next month!</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/local-experiments-to-cloud-with-tpihttps://dvc.org/blog/local-experiments-to-cloud-with-tpiThu, 12 May 2022 00:00:00 GMT<p>There are many reasons you might train a machine learning model locally. Mainly,
it's quick & easy to set up a new project on a local machine. This is sufficient
for simple experiments (with reduced data subsets or small models) without
paying to rent heavy cloud compute resources. A local machine is also deeply
familiar — as opposed to the multitude of available cloud services, which can
be intimidating even with a decent background in DevOps.</p>
<p>Once you locally set up and iterate over your data & code enough, you may reach
a point where more powerful compute resources are needed to train a larger model
and/or use bigger datasets. In other words, you might have to switch from
experimenting locally to a cloud environment. If you find yourself in this
situation, this tutorial will help you easily provision cloud infrastructure
with Terraform and run your existing training script on it.</p>
<h2 id="getting-started" style="position:relative;">Getting Started<a href="#getting-started" aria-label="getting started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This tutorial uses the
<a href="https://www.kaggle.com/jenny18/honey-bee-annotated-images" target="_blank" rel="nofollow noopener noreferrer">BeeImage Dataset</a>
which contains over 5,100 bee images annotated with location, date, time,
subspecies, health condition, caste, and pollen. Let's assume we've downloaded
the images, created a project, and trained a
<a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" target="_blank" rel="nofollow noopener noreferrer">convolutional neural network</a>
model locally to classify different subspecies. If you want to follow along, you
can use your own data and training code, or clone <a href="https://github.com/iterative/blog-tpi-bees" target="_blank" rel="nofollow noopener noreferrer">the example
repository</a>.</p>
<p>How do we continue iterating on our model in the cloud? Can we run more epochs
overnight? Change some hyperparameters? Add more layers? The first question when
planning <em>The Big Move</em> is "what dependencies are needed to train this model in
a cloud environment?"</p>
<p>Some of the important puzzle pieces you already have locally:</p>
<ul>
<li>Your training code. It is likely that you have a
<a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">whole pipeline</a> with multiple
stages but for the sake of simplicity, this tutorial uses a single <code>train.py</code>
script.</li>
<li>Data.</li>
<li>Python environment with all required libraries.</li>
</ul>
<p>You will also need an account with your cloud provider of choice. In this
tutorial we'll be provisioning infrastructure on
<a href="https://aws.amazon.com/" target="_blank" rel="nofollow noopener noreferrer">Amazon Web Services (AWS)</a>. You can create an AWS
account yourself, or ask your DevOps team to provide you with one.</p>
<admon type="info">
<p>Make sure to insert
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services" target="_blank" rel="nofollow noopener noreferrer">authentication credentials</a>
into your system's environment variables (<code>AWS_ACCESS_KEY_ID</code> and
<code>AWS_SECRET_ACCESS_KEY</code>).</p>
</admon>
<p>We can now start the move with the help of <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative
(TPI)</a>.</p>
<h2 id="what-is-terraform" style="position:relative;">What is Terraform?<a href="#what-is-terraform" aria-label="what is terraform permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<admon type="info">
<p><a href="https://www.terraform.io" target="_blank" rel="nofollow noopener noreferrer">Terraform</a> is an open-source infrastructure-as-code
tool that you should <a href="https://www.terraform.io/downloads" target="_blank" rel="nofollow noopener noreferrer">download and install</a>
for this tutorial.</p>
</admon>
<p>Terraform requires us to create a configuration file containing a declarative
description of the infrastructure we need. There's no need to read lots of cloud
documentation nor write lots of commands. Instead, you describe what your
infrastructure should ultimately look like. Behind the scenes, Terraform will
figure out what needs to be done. If you've cloned the <a href="https://github.com/iterative/blog-tpi-bees" target="_blank" rel="nofollow noopener noreferrer">repository</a>,
the <code>main.tf</code> configuration file is in the project's root. We'll explain its
contents below.</p>
<h2 id="terraform-provider-iterative-tpi" style="position:relative;">Terraform Provider Iterative (TPI)<a href="#terraform-provider-iterative-tpi" aria-label="terraform provider iterative tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Terraform can orchestrate a plethora of various resources for you, but for the
majority of projects you only need a few. Instead of shipping plugins
(providers) for all these resources in one bundle, Terraform downloads
<a href="https://www.terraform.io/docs/extend/how-terraform-works.html" target="_blank" rel="nofollow noopener noreferrer"><em>providers</em></a>
whenever required.</p>
<p>For this tutorial we will only need <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a>. It enables full lifecycle
management of computing resources from AWS, Microsoft Azure, Google Cloud
Platform, and Kubernetes. TPI provisions infrastructure, sync data, and also
executes your scripts — all via a single configuration file. It has a several
super neat features:</p>
<ul>
<li>The configuration for different cloud compute providers is nearly identical,
so you can easily migrate from one cloud provider to another.</li>
<li>It syncs data to and from the remote cloud and your local machine.</li>
<li>It allows you to use low-cost spot instances, and even automatically respawns
interrupted instances, restoring working directories/data and resuming running
tasks in the cloud even if you are offline.</li>
<li>Once your training is complete, the remote resources will be terminated,
avoiding unused machines quietly ramping up costs.</li>
</ul>
<p>To start using TPI we need to let Terraform know about it by writing this in our
<code>main.tf</code>:</p>
<div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span>
<span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
<span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span></code></pre></div>
<p>Once we describe what providers we need, run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> init</span></code></pre></div>
<admon type="info">
<p>If you have cloned the example repository, you should run this command before
doing anything else. This will initialize your working directory and download
the required provider(s).</p>
</admon>
<admon type="tip">
<p>It's probably also a good idea to set the logging level to see helpful info on
progress:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">TF_LOG_PROVIDER</span><span class="token operator">=</span>INFO</span></code></pre></div>
</admon>
<h2 id="configuring-iterative_task" style="position:relative;">Configuring <code>iterative_task</code><a href="#configuring-iterative_task" aria-label="configuring iterative_task permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>TPI offers a single resource called <code>iterative_task</code> that we'll need to
configure. This resource will:</p>
<ol>
<li>Create cloud resources (storage, machines) for the task.</li>
<li>If specified, upload a local working directory to the cloud storage.</li>
<li>Run the given script in the cloud until completion, error, or timeout.</li>
<li>If specified, download output results.</li>
<li>Automatically terminate compute resources upon task completion.</li>
</ol>
<p>This is exactly what we need to run a model training process! Let's see the
<code>iterative_task</code> in the <code>main.tf</code> file before delving into the details:</p>
<div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span>
<span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
<span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example-basic"</span> <span class="token punctuation">{</span>
<span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span> <span class="token comment"># or any of: gcp, az, k8s</span>
<span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m"</span> <span class="token comment"># medium. Or any of: l, xl, m+k80, xl+v100, ...</span>
<span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span> <span class="token comment"># auto-price. Default -1 to disable, or >0 for hourly USD limit</span>
<span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span> <span class="token comment"># 24h</span>
<span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"ubuntu"</span>
<span class="token keyword">storage</span> <span class="token punctuation">{</span>
<span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"src"</span>
<span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results-basic"</span>
<span class="token punctuation">}</span>
<span class="token property">environment</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">TF_CPP_MIN_LOG_LEVEL</span> <span class="token punctuation">=</span> <span class="token string">"1"</span> <span class="token punctuation">}</span>
<span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END
#!/bin/bash
sudo apt-get update -q
sudo apt-get install -yq python3-pip
pip3 install -r requirements.txt tensorflow-cpu==2.8.0
python3 train.py --output results-basic/metrics.json
END</span>
<span class="token punctuation">}</span></code></pre></div>
<p>Every Terraform resource needs a name; here it's <code>example-basic</code>. This name is
only used within the configuration file and it can be whatever you want. Inside
of the resource block, we specify some arguments:</p>
<ul>
<li><em>cloud</em> (<strong>required</strong>): cloud provider to run the task on. This can be <code>aws</code>,
<code>gcp</code>, <code>az</code>, or <code>k8s</code>.</li>
<li><em>machine</em>: if you know the exact kind of machine that you'd like to use, you
can specify it here. Alternatively,
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">TPI offers some common machine types</a>
which are roughly the same for all supported clouds. For example, <code>m+t4</code> means
"Medium, with (at least) 4 CPU cores, 16 GB RAM, and 1 NVIDIA Tesla T4 GPU
device".</li>
<li><em>spot</em>: set the
<a href="https://aws.amazon.com/ec2/spot/pricing/" target="_blank" rel="nofollow noopener noreferrer">spot instance price</a>. Here we use
<code>0</code> for automatic pricing, which should keep costs down. Alternatively you can
specify a positive number to set a maximum bidding price in USD per hour, or
<code>-1</code> to use on-demand pricing.</li>
<li><em>timeout</em>: maximum time to run before the instance is force-terminated. This
prevents forgotten long-running instances draining your budget.</li>
<li><em>image</em>: the container to use (in our case, Ubuntu LTS 20.04).</li>
<li><em>workdir</em>: a directory on your local machine relative to your project folder
which you would like to upload with the remote machine. This way you can share
your whole project or parts of it with a remote machine.</li>
<li><em>output</em>: a directory <strong>relative to <code>workdir</code></strong> to download after the task in
complete.</li>
<li><em>script</em> (<strong>required</strong>): this is where TPI's magic happens, i.e. what commands
to run in <code>workdir</code> on the provisioned cloud instance.</li>
</ul>
<admon type="tip">
<p>See the
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#argument-reference" target="_blank" rel="nofollow noopener noreferrer">resource arguments documentation</a>
for a full list.</p>
</admon>
<admon type="warn">
<p>Keep in mind the
<a href="https://aws.amazon.com/ec2/pricing/" target="_blank" rel="nofollow noopener noreferrer">the running costs of AWS EC2 instances</a>.
The <code>machine</code> used in the example above is not included in the free tier and
will incur charges. Using TPI's <code>spot</code> pricing will keep costs to a minimum
(roughly $0.15/hour for <code>m+t4</code> on AWS), but not eliminate them entirely.</p>
</admon>
<p>In the simplest scenario, all we need to do on a new machine to run the training
<code>script</code> is to set up the Python environment with required libraries. If you
simply want to train your model on a machine with more memory, this may be
enough. However, if you want your training code to leverage GPUs, we can make a
few small tweaks:</p>
<h2 id="training-with-gpu" style="position:relative;">Training with GPU<a href="#training-with-gpu" aria-label="training with gpu permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are several ways you can leverage GPU devices on a remote machine. You can
install all the required drivers and dependencies "manually" via a script, you
can use an existing Docker image, build your own, or just use the convenient
<code>nvidia</code> image pre-packaged with CUDA 11.3 GPU drivers.</p>
<div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span>
<span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
<span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example-gpu"</span> <span class="token punctuation">{</span>
<span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span>
<span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m+t4"</span> <span class="token comment"># 4 CPUs and an NVIDIA Tesla T4 GPU</span>
<span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span>
<span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span>
<span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"nvidia"</span> <span class="token comment"># has CUDA GPU drivers</span>
<span class="token keyword">storage</span> <span class="token punctuation">{</span>
<span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"src"</span>
<span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results-gpu"</span>
<span class="token punctuation">}</span>
<span class="token property">environment</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">TF_CPP_MIN_LOG_LEVEL</span> <span class="token punctuation">=</span> <span class="token string">"1"</span> <span class="token punctuation">}</span>
<span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END
#!/bin/bash
sudo apt-get update -q
sudo apt-get install -yq python3-pip
pip3 install -r requirements.txt tensorflow==2.8.0
python3 train.py --output results-gpu/metrics.json
END</span>
<span class="token punctuation">}</span></code></pre></div>
<h2 id="ready-set-apply" style="position:relative;">Ready… Set… Apply!<a href="#ready-set-apply" aria-label="ready set apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Whether you want to go with the basic example, or the GPU-enabled training, you
can run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> apply</span></code></pre></div>
<p>to review what steps Terraform is going to take to provision your desired
infrastructure. Don't worry, nothing is actually done at this point, but it's a
good way to check for potential issues in the configuration. You'll need to type
<code>yes</code> to confirm.</p>
<p>At this point you can go offline, get a cup of your preferred beverage, and let
TPI work its magic together with Terraform. They will allocate a remote machine
for you, upload you data and script, and run your code. Once the script
finishes, the machine will be terminated.</p>
<p>You can monitor what's going on at any point by running:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> refresh
</span><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> show</span></code></pre></div>
<p>This will print the logs and script's output. Once you see that the task has
successfully finished, run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> destroy</span></code></pre></div>
<p>to sync back your shared files and tear down all remote objects managed by your
configuration. If you output results (e.g. <code>results-gpu/metrics.json</code>), they'll
be synced back to your local machine.</p>
<p>Now if you want to try another experiment, you can change your code, run
<code>terraform apply</code> again, and when the training is finished, commit your code
together with the updated results. This can help you move from prototyping
locally to leveraging more powerful cloud instances without the hassle of full
MLOps setup. At the same time, once you're ready to start working on your
<a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">production pipelines and CI/CD</a>,
this <code>main.tf</code> codification should also make the transition smoother.</p>
<p>In this tutorial we covered the simplest example with no GPU, and one with GPUs.
In many cases, deploying your pipelines would be easier with your own Docker
image (both for prototyping and for production) and CI/CD workflows. If you'd
like to learn how to create your own Docker images and use them with TPI, see
<a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi-docker">part 2</a> of this blog post!</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cdhttps://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cdMon, 09 May 2022 00:00:00 GMT<h3 id="leveraging-cloud-resources-with-cicd-and-cml" style="position:relative;">Leveraging Cloud Resources with CI/CD and CML<a href="#leveraging-cloud-resources-with-cicd-and-cml" aria-label="leveraging cloud resources with cicd and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you use the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML library</a> in combination with CI/CD tools
like GitHub Actions or GitLab CI/CD, you can quickly and easily:</p>
<ol>
<li>provision a powerful virtual machine (VM) in the cloud as training Computer
Vision (CV) models often requires powerful GPUs rarely available on local
machines</li>
<li>submit your ML training job to it</li>
<li>save the results (metrics, models and other training artifacts)</li>
<li>automatically shut down the VM without having to worry about excessive cloud
bills</li>
</ol>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 460px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/300c88b3b1b5f65753629d661cc916e5/39600/cicd4ml.png" alt="Continuous Integration and Deployment for Machine Learning" title="Continuous Integration and Deployment for Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Continuous Integration and Deployment for Machine Learning</em></p>
<p>We've configured three
<a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/.github/workflows" target="_blank" rel="nofollow noopener noreferrer">workflow files</a>
for GitHub Actions, each of which corresponds to a particular stage depending on
the project's lifecycle we are in:</p>
<h4 id="1-workflow-for-experimentation-and-hyperparameter-tuning" style="position:relative;">1. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/1-experiment.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for experimentation and hyperparameter tuning</a><a href="#1-workflow-for-experimentation-and-hyperparameter-tuning" aria-label="1 workflow for experimentation and hyperparameter tuning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/361303930f78e3aafca3884430da2e6d/39600/workflow_exp.png" alt="Workflow for experimentation and hyperparameter tuning" title="Workflow for experimentation and hyperparameter tuning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Workflow for experimentation and hyperparameter tuning</em> In this stage, we'll
create an experiment branch so that can experiment with data preprocessing,
change model architecture, tune hyperparameters, etc. Once we think our
experiment is ready to be run, we'll push our changes to a remote repository (in
this case, GitHub). This push will trigger a CI/CD job in GitHub Actions, which
in turn will:</p>
<p>a) provision an EC2 virtual machine with a GPU in AWS:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on AWS EC2
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud=aws \
--cloud-region=us-east-1 \
--cloud-type=g4dn.xlarge \
--labels=cml-runner</span></code></pre></div>
<p>b) deploy our experiment branch to a Docker container on this machine:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">container</span><span class="token punctuation">:</span>
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">options</span><span class="token punctuation">:</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>gpus all
<span class="token key atrule">environment</span><span class="token punctuation">:</span> cloud
<span class="token key atrule">permissions</span><span class="token punctuation">:</span>
<span class="token key atrule">contents</span><span class="token punctuation">:</span> read
<span class="token key atrule">id-token</span><span class="token punctuation">:</span> write
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2</code></pre></div>
<p>c) rerun the entire DVC pipeline and push metrics back to GitHub:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> dvc<span class="token punctuation">-</span>repro<span class="token punctuation">-</span>cml
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
# Install dependencies
pipenv install --skip-lock
pipenv run dvc pull
pipenv run dvc exp run
pipenv run dvc push</span></code></pre></div>
<p>d) open a pull request and post a report to it that contains a table with
metrics and model outputs on a few test images:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Open a pull request</span>
cml <span class="token function">pr</span> dvc.lock metrics.json training_metrics.json training_metrics_dvc_plots/**
<span class="token comment"># Create CML report</span>
<span class="token builtin class-name">echo</span> <span class="token string">"## Metrics"</span> <span class="token operator">></span> report.md
pipenv run dvc metrics show <span class="token parameter variable">--md</span> <span class="token operator">>></span> report.md
<span class="token builtin class-name">echo</span> <span class="token string">"## A few random test images"</span> <span class="token operator">>></span> report.md
<span class="token keyword">for</span> <span class="token for-or-select variable">file</span> <span class="token keyword">in</span> <span class="token variable"><span class="token variable">$(</span><span class="token function">ls</span> data/test_preds/ <span class="token operator">|</span> <span class="token function">sort</span> <span class="token parameter variable">-R</span> <span class="token operator">|</span> <span class="token function">tail</span> <span class="token parameter variable">-20</span><span class="token variable">)</span></span><span class="token punctuation">;</span> <span class="token keyword">do</span>
cml publish data/test_preds/<span class="token variable">$file</span> <span class="token parameter variable">--md</span> <span class="token operator">>></span> report.md
<span class="token keyword">done</span>
cml send-comment <span class="token parameter variable">--pr</span> <span class="token parameter variable">--update</span> report.md</code></pre></div>
<p>The report structure is fully customizable. Below is an example of what the PR
and the CML report would look like in this case. The test images show (from left
to right) input images, ground truth masks and prediction masks.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1df47be6188799d34f3c0cc7678b0be4/39600/pr_cml_report.png" alt="PR and CML report" title="PR and CML report" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>PR
and CML report</em></p>
<p>At this point, we can assess the results in Iterative Studio and GitHub and
decide whether we want to accept the PR or keep experimenting.</p>
<h4 id="2-workflow-for-deploying-to-the-development-environment" style="position:relative;">2. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/2-develop.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for deploying to the development environment</a><a href="#2-workflow-for-deploying-to-the-development-environment" aria-label="2 workflow for deploying to the development environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/caff74247266358cdd3fb30f73a47aac/39600/workflow_dev.png" alt="Workflow for deploying to the development environment" title="Workflow for deploying to the development environment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Workflow for deploying to the development environment</em> Once we are happy with
our model's performance on the experiment branch, we can merge it into the
development branch. This would trigger a different CI/CD job that will:</p>
<p>a) retrain the model if the <code>dev</code> branch contains changes not present in the
<code>exp</code> branch. DVC will skip this stage if that's not the case. This step looks
almost identical to step (1.c) above (rerunning the pipeline & reporting metrics
on GitHub) in the above workflow.</p>
<p>b) deploy the web REST API application (that incorporates the new model) to a
development endpoint on Heroku:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">deploy-dev-api</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>and<span class="token punctuation">-</span>push
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/download<span class="token punctuation">-</span>artifact@master
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> model_pickle
<span class="token key atrule">path</span><span class="token punctuation">:</span> models
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> akhileshns/heroku<span class="token punctuation">-</span>[email protected]
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">heroku_api_key</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span>secrets.HEROKU_API_KEY<span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">heroku_app_name</span><span class="token punctuation">:</span> demo<span class="token punctuation">-</span>api<span class="token punctuation">-</span>mag<span class="token punctuation">-</span>tiles<span class="token punctuation">-</span>dev
<span class="token key atrule">heroku_email</span><span class="token punctuation">:</span> <span class="token string">'[email protected]'</span>
<span class="token key atrule">team</span><span class="token punctuation">:</span> iterative<span class="token punctuation">-</span>sandbox
<span class="token key atrule">usedocker</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div>
<p>The development endpoint is now accessible at</p>
<p><a href="https://demo-api-mag-tiles-dev.herokuapp.com/analyze" target="_blank" rel="nofollow noopener noreferrer">https://demo-api-mag-tiles-dev.herokuapp.com/analyze</a> (note <code>-dev</code>),</p>
<p>and we can use it to assess the end-to-end performance of the overall solution.
If we pick a random test image <code>exp3_num_258558.jpg</code>,
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 252px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5be68c36cd3947c4aa3c4c042639eece/88c24/exp3_num_258558.jpg" alt="Test image exp3_num_258558.jpg" title="Test image exp3_num_258558.jpg" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Test image <code>exp3_num_258558.jpg</code></em></p>
<p>we can send it to the endpoint using the <code>curl</code> command like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">curl</span> <span class="token parameter variable">-F</span> <span class="token string">'image=@data/MAGNETIC_TILE_SURFACE_DEFECTS/test_images/exp3_num_258558.jpg'</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-v</span> https://demo-api-mag-tiles-dev.herokuapp.com/analyze</span></code></pre></div>
<p>This will return some http-header info and the body of the response containing
the defect segmentation mask (<code>0</code> for pixel locations without defects and <code>1</code>
otherwise):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">* Trying 18.208.60.216:443...
* Connected to demo-api-mag-tiles-dev.herokuapp.com (18.208.60.216) port 443 (#0)
...
{"pred":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,</code></pre></div>
<p>Alternatively, we can do a similar thing with a Python script that also saves
the output mask into a <code>exp3_num_258558_mask.png</code> image:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> json
<span class="token keyword">from</span> pathlib <span class="token keyword">import</span> Path
<span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>cm <span class="token keyword">as</span> cm
<span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>pyplot <span class="token keyword">as</span> plt
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">import</span> requests
url <span class="token operator">=</span> <span class="token string">'https://demo-api-mag-tiles-dev.herokuapp.com/analyze'</span>
file_path <span class="token operator">=</span> Path<span class="token punctuation">(</span>
<span class="token string">'data/MAGNETIC_TILE_SURFACE_DEFECTS/test_images/exp3_num_258558.jpg'</span><span class="token punctuation">)</span>
files <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">'image'</span><span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">open</span><span class="token punctuation">(</span>file_path<span class="token punctuation">,</span> <span class="token string">'rb'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"image/jpeg"</span><span class="token punctuation">)</span><span class="token punctuation">}</span>
response <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> files<span class="token operator">=</span>files<span class="token punctuation">)</span>
data <span class="token operator">=</span> json<span class="token punctuation">.</span>loads<span class="token punctuation">(</span>response<span class="token punctuation">.</span>content<span class="token punctuation">)</span>
pred <span class="token operator">=</span> np<span class="token punctuation">.</span>array<span class="token punctuation">(</span>data<span class="token punctuation">[</span><span class="token string">'pred'</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>imsave<span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{</span>file_path<span class="token punctuation">.</span>stem<span class="token punctuation">}</span></span><span class="token string">_mask.png'</span></span><span class="token punctuation">,</span> pred<span class="token punctuation">,</span> cmap<span class="token operator">=</span>cm<span class="token punctuation">.</span>gray<span class="token punctuation">)</span></code></pre></div>
<p>Below you can see what this mask looks like.
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 252px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5df849e0bdc1e259cd2f1286c1636d68/019e0/exp3_num_258558_mask.png" alt="Output mask exp3_num_258558_mask.png" title="Output mask exp3_num_258558_mask.png" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Output mask <code>exp3_num_258558_mask.png</code></em></p>
<p>Before we merge the dev branch into the main branch, we would need to thoroughly
test and monitor the application in the development environment. A good test
could be duplicating real image requests to the dev endpoint for some period of
time and assess the quality of the returned segmentation masks.</p>
<h4 id="3-workflow-for-deploying-to-the-production-environment" style="position:relative;">3. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/3-deploy.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for deploying to the production environment</a><a href="#3-workflow-for-deploying-to-the-production-environment" aria-label="3 workflow for deploying to the production environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h4>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/37c9afef843976da2bd2d8d6cf1c744a/39600/workflow_prod.png" alt="Workflow for deploying to the production environment" title="Workflow for deploying to the production environment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Workflow for deploying to the production environment</em></p>
<p>If there are no issues and we are confident in the quality of the new model, we
can merge the development branch into the main branch of our repository. Again,
this triggers the third CI/CD workflow that deploys the code from the main
branch to the production API. This looks identical to the deployment into the
development environment, except now the deployment endpoint will be</p>
<p><a href="https://demo-api-mag-tiles-prod.herokuapp.com/analyze" target="_blank" rel="nofollow noopener noreferrer">https://demo-api-mag-tiles-prod.herokuapp.com/analyze</a> (note <code>-prod</code>).</p>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this series of posts (see <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines">Part 1</a> and <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments">Part 2</a>), we
described how we addressed the problem of building a Computer Vision Web API for
defect detection. We’ve chosen this approach because it addresses the common
challenges that are shared across many CV projects: how to version datasets that
consist of a large number of small- to medium-sized files; how to avoid
triggering long-running stages of an ML pipeline when it’s not needed for
reproducibility; how to run model training jobs on the cloud infrastructure
without having to provision and manage everything yourself; and, finally, how to
track progress in key metrics when you run many ML experiments.</p>
<p>We've talked about the following:</p>
<ul>
<li>Common difficulties when building Computer Vision Web API for defect detection
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#introduction">link</a>)</li>
<li>Pros and cons of exploratory work in Jupyter Notebooks
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#proof-of-concept-in-jupyter-notebooks">link</a>)</li>
<li>Versioning data in remote storage with DVC
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#data-versioning">link</a>)</li>
<li>Moving and refactoring the code from Jupyter Notebooks into DVC pipeline
stages
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#refactoring-jupyter-code-into-an-ml-pipeline">link</a>)</li>
<li>Experiment management and versioning
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments#experiment-management">link</a>)</li>
<li>Visualization of experiments and collaboration in Iterative Studio
(<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments#collaboration-and-reporting-with-iterative-studio">link</a>)</li>
<li>Remote experiments, CI/CD, and production deployment (this post)</li>
</ul>
<h2 id="what-to-try-next" style="position:relative;">What to Try Next<a href="#what-to-try-next" aria-label="what to try next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Missed the previous parts of this post? See <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines">Part 1: Data Versioning and ML
Pipelines</a> and <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments">Part 2: Local Experiments</a>.</p>
<ul>
<li>Reproduce this solution by setting your own configs, tokens, and access keys
for GitHub, AWS, and Heroku</li>
<li>Add a check to merge PRs automatically if the metrics have improved</li>
<li>Add a few simple unit tests and insert them into CML workflow files so they
run before reproducing the pipeline</li>
<li>Apply this approach to a different Computer Vision problem using a different
dataset or different problem type (image classification, object detection,
etc.)</li>
</ul>https://dvc.org/blog/CML-runners-saving-models-2https://dvc.org/blog/CML-runners-saving-models-2Fri, 06 May 2022 00:00:00 GMT<p>In <a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">part 1 of this guide</a> we
showed how you can use CML to provision an AWS EC2 instance to train your model
before saving the model to our Git repository. In doing so, we allowed ourselves
to terminate the training instance without losing our model altogether.</p>
<p>This worked perfectly fine for the simple model we trained, but this approach is
not optimal when dealing with larger models. GitHub starts warning you at 50MB
files and simply
<a href="https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github" target="_blank" rel="nofollow noopener noreferrer">won't upload anything over 100MB</a>.
<a href="https://docs.gitlab.com/ee/user/gitlab_com/index.html#account-and-limit-settings" target="_blank" rel="nofollow noopener noreferrer">GitLab similarly limits</a>
the size of files you can store in your repository. A beefy XGBoost model can
easily exceed 100MB and a neural network can go up into the gigabytes.</p>
<p>That means we cannot save these models directly to our repository. Luckily we
can look towards another one of Iterative's open-source tools:
<a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>. DVC includes a lot of features for managing machine
learning projects, such as ML pipelines, experiment tracking, and data
versioning. In this guide we will zoom in on just one of those features: remote
storage.</p>
<p>We can use DVC to save our model to a remote storage location, such as M3, HDFS,
an SFTP server, or even Google Drive. Much like Git tracks changes to your code,
DVC tracks changes to your data. It puts a reference to a specific version of
your data in the Git commit. That way your code is linked to a specific version
of your model, without containing the actual model.</p>
<p>In this part 2, we will show you how to save the model we trained in part 1 to a
DVC remote. At the end of this guide our CML workflow will be doing the folowing
on a daily basis:</p>
<ol>
<li>Provision an Amazon Web Services (AWS) EC2 instance</li>
<li>Train the model</li>
<li>Save the model to a DVC remote storage on Google Drive</li>
<li>Save the model metrics to a GitHub repository</li>
<li>Create a merge request with the new outputs</li>
<li>Terminate the AWS EC2 instance</li>
</ol>
<p>All files needed for this guide can be found in
<a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>.</p>
<admon type="tip">
<p>We will be using Google Drive as our remote storage. With slight modifications,
however, you can also use other remotes such as AWS S3, GCP Cloud Storage, and
Azure Storage. Please
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">refer to the DVC Docs</a>
for more details.</p>
</admon>
<h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Make sure to have followed
<a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">part 1 of this guide</a> and
have gotten CML up and running. The necessary files for all of this can be found
in <a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>.
Additionally, set up the following things beforehand:</p>
<ul>
<li><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">Install DVC</a></li>
<li><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#using-a-custom-google-cloud-project-recommended" target="_blank" rel="nofollow noopener noreferrer">Set up a GCP project</a></li>
<li><a href="https://console.cloud.google.com/apis/library/drive.googleapis.com" target="_blank" rel="nofollow noopener noreferrer">Enable the Google Drive API for your GCP project</a></li>
<li><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#using-service-accounts" target="_blank" rel="nofollow noopener noreferrer">Create a GCP service account and download the private key to a safe location</a></li>
<li><a href="https://support.google.com/drive/answer/2375091?hl=en&co=GENIE.Platform%3DDesktop" target="_blank" rel="nofollow noopener noreferrer">Create a Google Drive directory to save your model to</a></li>
<li><a href="https://support.google.com/drive/answer/7166529?hl=en&co=GENIE.Platform%3DDesktop" target="_blank" rel="nofollow noopener noreferrer">Grant the service account editor permissions to the Drive directory by sharing it with the service account's email address</a></li>
</ul>
<h1 id="setting-up-our-dvc-remote" style="position:relative;">Setting up our DVC remote<a href="#setting-up-our-dvc-remote" aria-label="setting up our dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>When first using DVC in a project, you need to initialize DVC by running
<a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>. This will create the structure DVC uses to keep track of versioning,
and ensures Git will not be tracking the files in the DVC repository. Instead,
Git will henceforth include a list of references to those files. Make sure to
commit the initialization to Git.</p>
<p>Then, in order to start using DVC for versioning, we need to set up a remote.
This is where our model files will end up, while DVC keeps track of their
respective versions. Here we will be using Google Drive as our remote.</p>
<p><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#setup-a-google-drive-dvc-remote" target="_blank" rel="nofollow noopener noreferrer">The DVC user guide</a>
explains how to set up a remote on Google Drive. If you would rather use another
remote, you can
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">find instructions here</a>.
In that case make sure to also update the DVC dependency in <code>requirements.txt</code>.</p>
<p>While DVC doesn't require a service account to work, we do need one in the
set-up we're aiming for. That's because without a service account we will need
to authorize ourselves through a log-in page every time. Our self-hosted runner
would get stuck on this page because we cannot authorize ourselves there.</p>
<p>In order to let DVC access the Google Drive folder we created from our runner,
we need to add two more GitHub Actions secrets: <code>GDRIVE_CREDENTIALS_DATA</code> and
<code>GOOGLE_DRIVE_URI</code>. The first one should contain the private key we downloaded
when setting up our service account (i.e. the <code>.json</code> file). The second one
should be the <a href="https://cloud.google.com/bigquery/external-data-drive" target="_blank" rel="nofollow noopener noreferrer">Drive URI</a>
to the directory we created in Google Drive (i.e. the sequence of random
characters at the end of our Google Drive URL).</p>
<h1 id="export-the-model-to-a-dvc-remote" style="position:relative;">Export the model to a DVC remote<a href="#export-the-model-to-a-dvc-remote" aria-label="export the model to a dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Now that we have set up the remote and made sure GitHub Actions has all the
details needed to access the remote, we can use the workflow below. In this
scenario, we train the model in the same way as in part 1, but we push it to the
DVC remote. A reference to the location of this file is added to the GitHub
repository (<code>model/random_forest.joblib.dvc</code>). The model itself is added to
<code>.gitignore</code> and not pushed to the repository.</p>
<p>The other files created in <code>train.py</code> are still pushed to an experiment branch
in GitHub. Afterwards a merge request is created.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML with DVC
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token comment"># Here we use two triggers: manually and daily at 08:00</span>
<span class="token key atrule">workflow_dispatch</span><span class="token punctuation">:</span>
<span class="token key atrule">schedule</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">cron</span><span class="token punctuation">:</span> <span class="token string">'0 8 * * *'</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on EC2
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud=aws \
--cloud-region=eu-west \
--cloud-type=t2.micro \
--labels=cml-runner \
--single</span>
<span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">GDRIVE_CREDENTIALS_DATA</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GDRIVE_CREDENTIALS_DATA <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml ci
pip install -r requirements.txt</span>
python get_data.py
python train.py
<span class="token comment"># Connect to your DVC remote storage and push the model to there</span>
dvc add model/random_forest.joblib <span class="token comment"># This automatically adds the model to your .gitignore</span>
dvc remote add <span class="token punctuation">-</span>d <span class="token punctuation">-</span>f myremote gdrive<span class="token punctuation">:</span>//$<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GOOGLE_DRIVE_URI <span class="token punctuation">}</span><span class="token punctuation">}</span>
dvc remote modify myremote gdrive_use_service_account true
dvc push
<span class="token comment"># Create pull request for the remaining files</span>
cml pr .
<span class="token comment"># Create CML report</span>
cat model/metrics.txt <span class="token punctuation">></span> report.md
cml publish model/confusion_matrix.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md
cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update report.md</code></pre></div>
<p>And that's it! We have broadly the same set-up as outlined in part 1 of this
guide, but we no longer use our GitHub repository for storing our model.
Instead, the model is now saved to Google Drive, which allows for much larger
models.</p>
<admon type="tip">
<p>In a situation where we retrain our model daily based on the most recent data,
it would make sense to also use DVC to keep track of the data used in each
training. We could, for example, use our runner to import our training data from
a table in our database and write both the data and the model to the DVC remote.
This is beyond the scope of this guide, but
<a href="https://github.com/iterative/cml_dvc_case" target="_blank" rel="nofollow noopener noreferrer">here you can find a repository that covers this</a>.</p>
</admon>
<admon type="tip">
<p>While we have achieved our goal of using DVC for our model storage, there are
some other DVC features we could benefit from as well. We could define a
reproducible pipeline to replace our manual <code>get_data.py</code> and <code>train.py</code> tasks.
<a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">Here you can find</a> more information
on how to achieve this.</p>
</admon>
<h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>As we saw in <a href="https://dvc.org/blog/CML-runners-saving-models-1">part 1 of this guide</a>, we can
use CML to automate a periodical retraining of our models on a self-hosted
runner. We were able to save the model to our GitHub repository, but that
approach has its limitations with regards to model size.</p>
<p>In this part 2 we worked around those limitations by saving our model to a DVC
remote instead. We set up Google Drive as our remote and adapted our CML
workflow to save our models there. All in all, we can now automatically
(re)train models using a self-hosted runner, track different model versions in
Git, and save models to a remote storage such as Google Drive for future
reference.</p>
<p>A great extension of our CI/CD would be a <code>deploy</code> step to bring the latest
version of our model into production. This step might be conditional on the
performance of the model; we could decide to only start using it in production
if it performs better than previous iterations. All of this warrants a guide of
its own, however, so look out for that in the future! 😉</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experimentshttps://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experimentsThu, 05 May 2022 00:00:00 GMT<h3 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Earlier</a>,
we built a pipeline that produces a trained Computer Vision model. Now we need a
way to efficiently tune its configuration and the hyperparameters of the model.
We want the ability to:</p>
<ul>
<li>Run many experiments and easily compare their results to pick the
best-performing ones.</li>
<li>Track the global history of the model's performance, and map each improvement
to a particular change in code, configuration, or data.</li>
<li>Zoom into the details of each training run to help us diagnose issues.</li>
</ul>
<h3 id="experiment-management" style="position:relative;">Experiment Management<a href="#experiment-management" aria-label="experiment management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our DVC pipeline relies on the parameters defined in
the<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/params.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>params.yaml</code></a>
file in this case (see other possible file types
<a href="https://dvc.org/doc/command-reference/params#description" target="_blank" rel="nofollow noopener noreferrer">here</a>). By loading
its contents in each stage, we can avoid hard-coded parameters. It also allows
rerunning the whole or parts of our pipeline under a different set of
parameters. The DVC pipeline YAML file
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a>
supports a
<a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">templating format</a>
to insert values from different sources in the YAML structure itself.</p>
<p>DVC tracks which stages of the pipeline experienced changes and only reruns
those. By changes, we mean <em>everything</em> that might affect the predictive
performance of your model like changes to the dataset, source code and/or
parameters. This not only ensures complete reproducibility but often
significantly reduces the time needed to rerun the whole pipeline while ensuring
consistent results on every rerun. For example, at first, we started with a
pixel accuracy metric (the percent of pixels in your image that are classified
correctly). Later, we realized that it might not be the best metric to track (as
described in
<a href="https://towardsdatascience.com/metrics-to-evaluate-your-semantic-segmentation-model-6bcb99639aa2" target="_blank" rel="nofollow noopener noreferrer">this blog post</a>),
and we decided to include the Dice coefficient into our metrics. There is no
reason for us to rerun the often time-consuming data preprocessing and model
training stages if we want to incorporate these updates. DVC pipelines can skip
the execution of these stages without our explicit instructions:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Running stage 'check_packages':
> pipenv run pip freeze > requirements.txt
Stage 'data_load' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
> python src/stages/eval.py --config=params.yaml
...</code></pre></div>
<p>There is a super convenient set of
<a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">Experiment Management</a>
features that make switching between reproducible experiments very easy without
adding failed experiments to your git history. Check out this
<a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">blog post</a>, which talks about
the idea of "ML Experiments as Code." That means treating experiments as you'd
treat code, that is, use git to track all changes in configs, metrics, and data
versions through text files. This approach removes the need for a separate
database/online service to store experiment metadata. If wanted to run a few
experiments with different scales of learning rate values (e.g. <code>0.1</code>, <code>0.01</code>
and <code>0.001</code>), we'd do that as follows:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.1</span>
</span>...
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.01</span>
</span>...
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.001</span>
</span>...</code></pre></div>
<p>Optionally, you can delay the execution of the experiments by putting them in a
<a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments#the-experiments-queue" target="_blank" rel="nofollow noopener noreferrer">queue</a>,
and execute them later with the <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a> command.</p>
<p>These local experiments are powered by Git references, and you can learn about
them in <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this post</a>. We can display all
experiments with the <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--only-changed</span> <span class="token parameter variable">--sort-by</span><span class="token operator">=</span>dice_mean</span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable">──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>train.loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>valid.loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>foreground.acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>jaccard.coeff<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>dice.multi<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>dice_mean<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc_mean<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.learning_rate<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.batch_size<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>models<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace - 0.10356 0.069076 0.90321 0.75906 0.92371 0.70612 0.97689 0.01 16 5854528
exp Apr 09, 2022 0.13305 0.087599 0.77803 0.66494 0.89084 0.70534 0.97891 0.01 8 6c513ae
├── 83a4975 [exp-2d80e] Apr 09, 2022 0.11189 0.088695 0.86905 0.75296 0.92005 0.70612 0.97689 0.01 16 5854528
├── 675efb3 [exp-6c274] Apr 09, 2022 0.10356 0.069076 0.90321 0.75906 0.92371 0.71492 0.98099 0.1 16 770745a
└── c8b1857 [exp-04bcd] Apr 09, 2022 0.11189 0.088695 0.86905 0.75296 0.92005 0.71619 0.98025 0.01 8 094c420
</span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Once we identify one or a few best ones (e.g., highest <code>dice_mean</code> score), we
can
<a href="https://dvc.org/doc/user-guide/experiment-management/persisting-experiments" target="_blank" rel="nofollow noopener noreferrer">persist</a>
them by creating a branch out of an experiment:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-04bcd my-branch
</span>Git branch 'my-branch' has been created from experiment 'exp-04bcd'.
To switch to the new branch run:
git checkout my-branch</code></pre></div>
<p>To track detailed information about the training process, we integrated
<a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> into the training code by
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/41a057cf9b9a4a738087c8ad046b99c21f4faf17/src/utils/train_utils.py#L45" target="_blank" rel="nofollow noopener noreferrer">adding a callback object</a>
to the training function. DVCLive is a Python library for logging machine
learning metrics and other metadata in simple file formats, which is fully
compatible with DVC.</p>
<h2 id="collaboration-and-reporting-with-iterative-studio" style="position:relative;">Collaboration and Reporting with Iterative Studio<a href="#collaboration-and-reporting-with-iterative-studio" aria-label="collaboration and reporting with iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>What if we needed to report the results to our team members or maybe hand over
the project to one of them? How do we communicate everything we did since the
conception of the project? What things resulted in the most significant
improvements? What things didn't seem to matter at all?</p>
<p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a web-based application with
seamless integration with DVC for data and model management, experiment
tracking, visualization, and automation. It becomes especially valuable when
collaborating with others on the same project or when there's a need to
summarize the progress of the project through metrics and plots. All that's
needed is to connect the project's repository with Studio. Then Studio will
automatically parse all required information from <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, <code>params.yaml</code>, and
other text files that DVC recognizes. The result will be a repository view. The
view for our project is
<a href="https://studio.datachain.ai/user/alex000kim/views/magnetic-tiles-defect-5kozhnu9jo" target="_blank" rel="nofollow noopener noreferrer">here</a>.
It displays commits, metrics, parameters, the remote location of data and models
tracked by DVC, and more.</p>
<p>In the screenshot below, you can see that we created a separate <code>exp</code> branch
that displays the results of the local experiments that we decided to upload to
our remote repository, like trying different learning rates and batch sizes.
Note that earlier, we discarded all local experiments whose performance we
weren't satisfied with.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/113731a6cdf78d5f57dc0d416fcffd28/39600/studio_view.png" alt="Studio view" title="Studio view" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Below we can see the evolution of the key metrics and the value of the loss
function throughout training (enabled by the earlier integration of
<a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>) for a set of selected commits.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9ca3705395f114f2c0d24507db503398/39600/dvc_live_studio.png" alt="DVCLive metrics displayed in Studio" title="DVCLive metrics displayed in Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Now, for example, if we see that the loss function hasn't reached a plateau
after a certain number of epochs, we'll try increasing this number. Or, even
worse, if we see the loss function growing over time, it'll be an indication
that our learning rate may be too high. In this case, we may generate a few
additional experiments with lower learning rate values, eventually picking the
one that achieves good model performance after a reasonable number of training
epochs.</p>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this post, we talked about the following:</p>
<ul>
<li>How to run and view ML experiments locally and commit the most promising ones
to the remote git repository</li>
<li>How the integration of Iterative Studio with DVC enables collaboration,
traceability, and reporting on projects with multiple team members</li>
<li>How DVCLive allows us to peek into the training process and helps us decide
what ideas to try next</li>
</ul>
<p>What if we don't have a machine with a powerful GPU, and we'd like to take
advantage of our cloud infrastructure? What if we'd like to have a custom report
(with metrics, plots, and other visuals) accompany every commit/pull request on
GitHub? The third (and last) part of this series of posts will demonstrate how
another open-source tool from the Iterative ecosystem, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>,
addresses these issues.</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelineshttps://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelinesTue, 03 May 2022 00:00:00 GMT<h3 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In this series of posts, we'll describe an approach that streamlines the
lifecycles stages of a typical Computer Vision project going from
proof-of-concept to configuration and parameter tuning to, finally, deployment
to the production environment.</p>
<p>Automatic defect detection is a common problem encountered in many industries,
especially manufacturing. A typical setup would include a conveyor belt that
moves some products along the production line and a camera installed above the
conveyor. The camera takes pictures of the products moving below and connects to
a computer that controls it. This computer needs to send raw images to some
defect detection service, receive information about the location and size of the
defects, if any, and may even control what happens to a defective product by
being connected to a robotic arm via a PLC (programmable logic controller).</p>
<p>As our demo project, we've selected a very common deployment pattern for this
setup: a CV model wrapped in a web API service. Specifically, we'll perform an
<a href="https://ai.stanford.edu/~syyeung/cvweb/tutorial3.html" target="_blank" rel="nofollow noopener noreferrer">image segmentation</a> task
on a magnetic tiles dataset first introduced in this
<a href="https://www.researchgate.net/profile/Congying-Qiu/publication/327701995_Saliency_defect_detection_of_magnetic_tiles/links/5b9fd1bd45851574f7d25019/Saliency-defect-detection-of-magnetic-tiles.pdf" target="_blank" rel="nofollow noopener noreferrer">paper</a>
and available in this GitHub
<a href="https://github.com/abin24/Magnetic-tile-defect-datasets." target="_blank" rel="nofollow noopener noreferrer">repository</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ccf0a3239a685a57ccdab7f42d52f25f/39600/dataset_sample.png" alt="A sample from the image segmentation dataset we used for this project. Top
row: images of magnetic tile surfaces. Bottom row: segmentation mask (white
pixels show defective areas)" title="A sample from the image segmentation dataset we used for this project. Top
row: images of magnetic tile surfaces. Bottom row: segmentation mask (white
pixels show defective areas)" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<ul>
<li>This post (part 1) introduces the concepts of data versioning and ML pipelines
as they apply to Computer Vision projects.</li>
<li>Part 2 will focus on experiment tracking and management - key components
needed for effective collaboration between team members.</li>
<li>In part 3, you’ll learn how to easily move your model training workloads from
a local machine to cloud infrastructure and set up proper CI/CD workflows for
ML projects.</li>
</ul>
<h3 id="target-audience" style="position:relative;">Target Audience<a href="#target-audience" aria-label="target audience permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We assume the target audience of this post to be technical folks who are
familiar with the general Computer Vision concepts, CI/CD processes, and Cloud
infrastructure. Familiarity with the Iterative ecosystem of tools such as
<a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>, and
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Studio</a> is not required but would help with
understanding the nuances of our solution.</p>
<h3 id="summary-of-the-solution" style="position:relative;">Summary of the Solution<a href="#summary-of-the-solution" aria-label="summary of the solution permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>All the code for the project is stored in this GitHub
<a href="https://github.com/iterative/magnetic-tiles-defect" target="_blank" rel="nofollow noopener noreferrer">repository</a>.</p>
<p>The CV API solution that we are proposing can be summarized in the following
steps:</p>
<ul>
<li>Client service will submit the image to our API endpoint</li>
<li>The image will be preprocessed to adhere to the specifications that our model
expects</li>
<li>The CV model will ingest the processed image and output its prediction image
mask</li>
<li>Some postprocessing will be applied to the image mask</li>
<li>A reply back to the client with the output mask</li>
</ul>
<p>The repository also contains code for the web application itself, which can be
found in the
<a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/app" target="_blank" rel="nofollow noopener noreferrer"><code>app</code></a>
directory. While the web application is very simple, its implementation is
beyond the scope of this blog post. In short, we can say that it's based on the
<a href="https://fastapi.tiangolo.com/" target="_blank" rel="nofollow noopener noreferrer"><code>FastAPI</code></a> library, and we deploy it to the
Heroku platform through a Docker container defined in this
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/Dockerfile" target="_blank" rel="nofollow noopener noreferrer"><code>Dockerfile</code></a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 671px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b38f70c06e772513379790ada3051d4b/508aa/web_api_diagram.png" alt="Simplified diagram of the CV API
solution" title="Simplified diagram of the CV API
solution" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="prerequisites-for-reproduction" style="position:relative;">Prerequisites for Reproduction<a href="#prerequisites-for-reproduction" aria-label="prerequisites for reproduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Feel free to fork the
<a href="https://github.com/iterative/magnetic-tiles-defect" target="_blank" rel="nofollow noopener noreferrer">repository</a> if you'd like
to replicate our steps and deploy your own API service. Keep in mind that you'll
need to set up and configure the following:</p>
<ul>
<li>GitHub account and
<a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token" target="_blank" rel="nofollow noopener noreferrer">GitHub application token</a></li>
<li><a href="https://pipenv.pypa.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer"><code>pipenv</code></a> installed locally</li>
<li>AWS account,
<a href="https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/" target="_blank" rel="nofollow noopener noreferrer">access keys</a>,
and an
<a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html" target="_blank" rel="nofollow noopener noreferrer">S3 bucket</a></li>
<li>Heroku account and
<a href="https://help.heroku.com/PBGP6IDE/how-should-i-generate-an-api-key-that-allows-me-to-use-the-heroku-platform-api" target="_blank" rel="nofollow noopener noreferrer">Heroku API key</a></li>
</ul>
<p>For security reasons, you'll need to set up all keys and tokens through
<a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets" target="_blank" rel="nofollow noopener noreferrer">GitHub secrets</a>.
You'll also need to change the remote location (and its name) in the
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.dvc/config" target="_blank" rel="nofollow noopener noreferrer">DVC config</a>
file for versioning data and other artifacts.</p>
<h3 id="proof-of-concept-in-jupyter-notebooks" style="position:relative;">Proof-of-Concept in Jupyter Notebooks<a href="#proof-of-concept-in-jupyter-notebooks" aria-label="proof of concept in jupyter notebooks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A typical ML project would start with data collection and/or labeling, but we
are skipping all this hard work because it was done for us by the researchers
who published the dataset.</p>
<p>We'll get right to the exciting part of training CV models in Jupyter notebooks
which you can find
<a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/notebooks" target="_blank" rel="nofollow noopener noreferrer">here</a>.
In short, there we have three notebooks:</p>
<ol>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/1_ProcessData.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>1_ProcessData.ipynb</code></a>
downloads, processes, and organizes the data for easy loading into the
training process later</li>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/2_TrainSegmentationModel.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>2_TrainSegmentationModel.ipynb</code></a>
uses <a href="https://github.com/fastai/fastai" target="_blank" rel="nofollow noopener noreferrer"><code>fastai</code></a> Deep Learning framework to
train an image segmentation model</li>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/3_Evaluate.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>3_Evaluate.ipynb</code></a>
computes model performance on the test dataset</li>
</ol>
<p>Jupyter Notebook is by far the most popular tool for quick exploratory work when
it comes to data analysis and modeling. However, it's not without
<a href="https://www.youtube.com/watch?v=7jiPeIFXb6U" target="_blank" rel="nofollow noopener noreferrer">its own limitations</a>. One of the
biggest issues of Jupyter is that it has no guardrails to ensure
reproducibility, e.g. hidden states of variables and objects as well as the
possibility to run cells out of order. While there are several projects that
attempt to alleviate some of these issues (notably,
<a href="https://github.com/stitchfix/nodebook" target="_blank" rel="nofollow noopener noreferrer"><code>nodebook</code></a>,
<a href="https://github.com/nteract/papermill" target="_blank" rel="nofollow noopener noreferrer"><code>papermill</code></a>,
<a href="https://github.com/jupyter/nbdime" target="_blank" rel="nofollow noopener noreferrer"><code>nbdime</code></a>,
<a href="https://github.com/computationalmodelling/nbval" target="_blank" rel="nofollow noopener noreferrer"><code>nbval</code></a>,
<a href="https://github.com/kynan/nbstripout" target="_blank" rel="nofollow noopener noreferrer"><code>nbstripout</code></a>, and
<a href="https://github.com/nbQA-dev/nbQA" target="_blank" rel="nofollow noopener noreferrer"><code>nbQA</code></a>), they don’t solve them completely.</p>
<p>That's where the concepts of data versioning and ML pipelines come in.</p>
<h3 id="data-versioning" style="position:relative;">Data Versioning<a href="#data-versioning" aria-label="data versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In most ML projects, training data changes gradually over time as new training
instances (images in our case) get added while older ones might be removed.
Simply creating snapshots of our training data at the time of training (e.g.
labeling data directories with dates) quickly becomes unsustainable since these
snapshots will contain many duplicates. Additionally, tracking which data
directory was used to train each model becomes hard to manage very fast; and
linking data versions and models to their respective code versions complicates
things even further.</p>
<p>A much better approach is to:</p>
<ol>
<li>
<p>track only the deltas between different versions of the datasets; and</p>
</li>
<li>
<p>have the project’s git repository store only the reference links to the data
while the actual data is stored in a remote storage</p>
</li>
</ol>
<p>This is exactly what we can do with <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> by running only a
couple of DVC commands. In turn, DVC handles all the underlying complexity of
managing data versions, performing file deduplication, pushing and pulling
to/from different remote storage solutions and more.</p>
<p>Check out
<a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial" target="_blank" rel="nofollow noopener noreferrer">this tutorial</a>
to learn more about data and model versioning with DVC.</p>
<p><img src="https://editor.analyticsvidhya.com../uploads/86351git-dvc.png" alt="Diagram of how DVC performs data versioning
"></p>
<p>In this project, AWS S3 is our remote storage configured in the
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.dvc/config" target="_blank" rel="nofollow noopener noreferrer"><code>.dvc/config</code></a>
file. In other words, we store the images in an AWS bucket while only keeping
references to those files in our git repository.</p>
<h3 id="refactoring-jupyter-code-into-an-ml-pipeline" style="position:relative;">Refactoring Jupyter code into an ML pipeline<a href="#refactoring-jupyter-code-into-an-ml-pipeline" aria-label="refactoring jupyter code into an ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another powerful set of DVC features is ML pipelines. An ML pipeline is a way to
codify and automate the workflow used to reproduce a machine learning model. A
pipeline consists of a sequence of stages.</p>
<p>First, we did some refactoring of our Jupyter code into individual and
self-contained modules:</p>
<ul>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/data_load.py" target="_blank" rel="nofollow noopener noreferrer"><code>data_load.py</code></a>
downloads raw data locally</li>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/data_split.py" target="_blank" rel="nofollow noopener noreferrer"><code>data_split.py</code></a>
splits data into train and test subsets</li>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/train.py" target="_blank" rel="nofollow noopener noreferrer"><code>train.py</code></a>
uses <a href="https://github.com/fastai/fastai" target="_blank" rel="nofollow noopener noreferrer"><code>fastai</code></a> library to train a UNet
model with a ResNet-34 encoder and saves it into a pickle file</li>
<li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/eval.py" target="_blank" rel="nofollow noopener noreferrer"><code>eval.py</code></a>
evaluates the model's performance on the test subset</li>
</ul>
<p>Specific execution commands, dependencies, and outputs of each stage are defined
in the pipeline file
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a>
(more about pipelines files
<a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files" target="_blank" rel="nofollow noopener noreferrer">here</a>).</p>
<p>We've also added an optional
<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml#L2" target="_blank" rel="nofollow noopener noreferrer"><code>check_packages</code></a>
stage that freezes the environment into a <code>requirements.txt</code> file containing all
python packages and their versions installed in the environment. We enabled the
<a href="https://dvc.org/doc/command-reference/run#--always-changed" target="_blank" rel="nofollow noopener noreferrer"><code>always_changed</code></a>
field in the configuration of this stage to ensure DVC reruns this stage every
time. All other stages have this text file as a dependency. Thus, the entire
pipeline will be rerun if anything about our python environment changes.</p>
<p>We can see the whole dependency graph (directed acyclic graph, to be exact)
using the <a href="https://dvc.org/doc/command-reference/dag" target="_blank" rel="nofollow noopener noreferrer"><code>dvc dag</code></a> command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc dag</span>
</span> +----------------+
| check_packages |
*****+----------------+
***** * ** **
**** ** ** ***
*** ** ** ***
+-----------+ ** * ***
| data_load | ** * *
+-----------+ ** * *
*** ** * *
* ** * *
** * * *
+------------+ * *
| data_split |*** * *
+------------+ *** * *
* *** * *
* *** * *
* ** * *
** +-------+ ***
*** | train | ***
*** +-------+ ***
*** ** ***
*** ** ***
** ***
+----------+
| evaluate |
+----------+</code></pre></div>
<p>The entire pipeline can be easily reproduced with the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Running stage 'check_packages':
> python src/stages/check_pkgs.py --config=params.yaml
...
Running stage 'data_load':
> python src/stages/data_load.py --config=params.yaml
...
Running stage 'data_split':
> python src/stages/data_split.py --config=params.yaml
...
Running stage 'train':
> python src/stages/train.py --config=params.yaml
...
Running stage 'evaluate':
> python src/stages/eval.py --config=params.yaml
...</code></pre></div>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this first part of the blog post, we talked about the following:</p>
<ul>
<li>Common difficulties when building Computer Vision Web API for defect detection</li>
<li>Pros and cons of exploratory work in Jupyter Notebooks</li>
<li>Versioning data in remote storage with DVC</li>
<li>Moving and refactoring the code from Jupyter Notebooks into DVC pipeline
stages</li>
</ul>
<p>In the second part, we’ll see how to get the most out of experiment tracking and
management by seamlessly integrating <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>,
<a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>, and
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>.</p>https://dvc.org/blog/april-22-community-gemshttps://dvc.org/blog/april-22-community-gemsThu, 28 Apr 2022 00:00:00 GMT<h3 id="when-i-run-dvc-repro-on-a-stage-does-it-automatically-push-any-outputs-to-my-remote" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/953616587523498025" target="_blank" rel="nofollow noopener noreferrer">When I run <code>dvc repro</code> on a stage, does it automatically push any outputs to my remote?</a><a href="#when-i-run-dvc-repro-on-a-stage-does-it-automatically-push-any-outputs-to-my-remote" aria-label="when i run dvc repro on a stage does it automatically push any outputs to my remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question from @tina_rey!</p>
<p>The <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command doesn't automatically push any outputs or data to your
remote. The outputs are stored in the cache until you run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, which then
pushes them from your cache to your remote.</p>
<h3 id="is-dvc-dag-based-on-deps-and-outs-so-that-a-stage-that-depends-on-the-output-of-another-stage-will-always-be-executed-after-the-former-has-finished" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956113493155799070" target="_blank" rel="nofollow noopener noreferrer">Is <code>dvc dag</code> based on <code>deps</code> and <code>outs</code>, so that a stage that depends on the output of another stage will always be executed after the former has finished?</a><a href="#is-dvc-dag-based-on-deps-and-outs-so-that-a-stage-that-depends-on-the-output-of-another-stage-will-always-be-executed-after-the-former-has-finished" aria-label="is dvc dag based on deps and outs so that a stage that depends on the output of another stage will always be executed after the former has finished permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a good question from @johnysku!</p>
<p>That is correct! If the pipelines are independent or the stages are independent,
they may run in any order. Without explicit dependency linkage, stages could be
executed in an unexpected order.</p>
<h3 id="if-i-want-to-use-the-foreach-utility-in-dvc-repro-is-there-a-way-i-can-use-glob-patterns-to-create-the-list-dvc-needs-to-iterate-over" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956241424150577233" target="_blank" rel="nofollow noopener noreferrer">If I want to use the <code>foreach</code> utility in <code>dvc repro</code>, is there a way I can use glob patterns to create the list DVC needs to iterate over?</a><a href="#if-i-want-to-use-the-foreach-utility-in-dvc-repro-is-there-a-way-i-can-use-glob-patterns-to-create-the-list-dvc-needs-to-iterate-over" aria-label="if i want to use the foreach utility in dvc repro is there a way i can use glob patterns to create the list dvc needs to iterate over permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another interesting question from @copah!</p>
<p>If you have <code>mystage</code> which uses <code>foreach</code>, you can do <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> to <code>mystage</code>
to iterate over every <code>mystage</code> stage.</p>
<h3 id="how-does-dvc-handle-files-that-have-been-deleted-from-remote-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956254582676258866" target="_blank" rel="nofollow noopener noreferrer">How does DVC handle files that have been deleted from remote storage?</a><a href="#how-does-dvc-handle-files-that-have-been-deleted-from-remote-storage" aria-label="how does dvc handle files that have been deleted from remote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Really good question from @Meme Philosopher!</p>
<p>DVC will fail when you try to pull files that have been deleted from the remote
and notify you that those files are missing in remote storage.</p>
<h3 id="can-i-separate-cml-running-from-github-actions-vm-to-work-with-gcp-or-aws-so-training-and-testing-are-in-these-cloud-environments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/954316332457947169" target="_blank" rel="nofollow noopener noreferrer">Can I separate CML running from GitHub actions VM to work with GCP or AWS so training and testing are in these cloud environments?</a><a href="#can-i-separate-cml-running-from-github-actions-vm-to-work-with-gcp-or-aws-so-training-and-testing-are-in-these-cloud-environments" aria-label="can i separate cml running from github actions vm to work with gcp or aws so training and testing are in these cloud environments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @Atsu!</p>
<p>This is supported out-of-the-box! Here's how it works:</p>
<ol>
<li>Within Github Actions, CML launches a
<a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">self-hosted runner</a> on GCP or AWS
using <code>cml runner --labels=cml --cloud=gcp</code>/<code>--cloud=aws</code></li>
<li>GitHub Actions runs the rest of the workflow on the self-hosted runner using
<code>runs-on: [self-hosted, cml]</code> and the maximum allowable
<code>timeout-minutes: 4320</code></li>
<li>If GitHub Actions is about to timeout, CML will restart the workflow, so make
sure your code regularly caches and restores data if it's expected to take >3
days to run.</li>
</ol>
<p>You can follow along with
<a href="https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">this doc</a>
to get started.</p>
<p>The key is requesting GitHub's
<a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#usage-limits" target="_blank" rel="nofollow noopener noreferrer">maximum <code>timeout-minutes: 4320</code></a>.
This signals to CML to
<a href="https://cml.dev/doc/ref/runner#faqs-and-known-issues" target="_blank" rel="nofollow noopener noreferrer">restart the workflow</a>
just before the timeout. You'll also have to write your code to cache results so
that the restarted workflow will use previous results (e.g. use
<a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints</a>
and <a href="https://github.com/iterative/dvc/issues/6823" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc/issues/6823</a>)</p>
<h3 id="when-running-an-experiment-from-the-web-interface-with-dvc-is-there-any-way-to-get-the-new-metrics-to-show-on-the-commit-created-by-iterative-studio-for-the-experiment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/957931058639306772" target="_blank" rel="nofollow noopener noreferrer">When running an experiment from the web interface with DVC, is there any way to get the new metrics to show on the commit created by Iterative Studio for the experiment?</a><a href="#when-running-an-experiment-from-the-web-interface-with-dvc-is-there-any-way-to-get-the-new-metrics-to-show-on-the-commit-created-by-iterative-studio-for-the-experiment" aria-label="when running an experiment from the web interface with dvc is there any way to get the new metrics to show on the commit created by iterative studio for the experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Awesome question about Studio from @Benjamin-Etheredge!</p>
<p>In order to show the experiment results in Studio, you would have to commit and
push the results as part of your CI (continuous integration) action. Here's an
<a href="https://github.com/iterative/demo-fashion-mnist/blob/main/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">example GitHub action script</a>
that does this.</p>
<p>We do understand that it is not ideal that there are 2 commits, one with your
changes and one with the results. We have been thinking about how this can be
improved and it would be great to hear if you have
<a href="https://github.com/iterative/studio-support/" target="_blank" rel="nofollow noopener noreferrer">any thoughts/ideas</a>!</p>
<h3 id="is-there-a-way-to-get-dvc-to-import-from-a-private-repository" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/964204106824695868" target="_blank" rel="nofollow noopener noreferrer">Is there a way to get DVC to import from a private repository?</a><a href="#is-there-a-way-to-get-dvc-to-import-from-a-private-repository" aria-label="is there a way to get dvc to import from a private repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question from @qubvel!</p>
<p>You can use SSH to handle this and run the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> [email protected]:<span class="token operator"><</span>reposiotry location<span class="token operator">></span> <span class="token operator"><</span>data_path<span class="token operator">></span></span></code></pre></div>
<h3 id="if-i-use-a-local-remote-and-a-shared-cache-will-the-data-be-symlinked-from-the-remote-to-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/963768504987815987" target="_blank" rel="nofollow noopener noreferrer">If I use a local remote and a shared cache, will the data be symlinked from the remote to the cache?</a><a href="#if-i-use-a-local-remote-and-a-shared-cache-will-the-data-be-symlinked-from-the-remote-to-the-cache" aria-label="if i use a local remote and a shared cache will the data be symlinked from the remote to the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Very interesting question from @cajoek!</p>
<p>The data will <em>not</em> be symlinked from the remote to the cache.</p>
<p>Sometimes we can treat cache as something temporary so a lot of data that will
never be used can get there from failed experiments, etc. In this case having a
local remote to keep track of important data for important versions of your
project would be good.</p>
<p>That way, later when your cache is too big and the project takes up too much
space, you can remove <code>.dvc/cache</code> and download latest important version from
remote.</p>
<hr>
<p><img src="https://media.giphy.com/media/f8QPB1rgHbwhcD2Jd6/giphy.gif" alt="iAM_Learning GIF"></p>
<p>At our May Office Hours Meetup we will have Matt Squire of Fuzzy Labs join us
sharing his view on open source MLOps tools!
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285550813" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/terraform-providerhttps://dvc.org/blog/terraform-providerWed, 27 Apr 2022 00:00:00 GMT<p>The requirements for Machine Learning (ML) infrastructure are becoming
increasingly complex. Training large models often requires specialized hardware
(GPUs, TPUs) which involves moving the whole training process onto cloud
machines, setting up environments and synchronizing data. For teams that want to
leverage spot instances, the setup becomes even more complex — they need to
make sure the training progress is not lost during spot instance recovery. This
is time-consuming, and requires expertise in both DevOps and Machine Learning.
Additionally, training in a cloud environment can incur high costs due to the
need for expensive hardware, as well as users forgetting to shutdown instances
when training is complete.</p>
<p>To address the specific needs of machine learning teams, we have built
<a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a>.
TPI is an open-source tool extending the functionality of Terraform, the world's
most widely used multi-cloud provisioning product. The Iterative Provider
enables full lifecycle management of computing resources and is designed
specifically for machine learning pipelines.</p>
<h2 id="tailored-to-machine-learning-workflows" style="position:relative;">Tailored to Machine Learning Workflows<a href="#tailored-to-machine-learning-workflows" aria-label="tailored to machine learning workflows permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The Iterative Provider offers a single resource called <code>iterative_task</code> which
you can use to configure:</p>
<ul>
<li>Your cloud infrastructure</li>
<li>The steps to perform on the cloud resource, i.e. setting up the environment,
running the training pipeline, logging metrics, etc.</li>
<li>The data to be synced back once the training is complete (e.g. a file with
metrics, a model, plots)</li>
</ul>
<p>Here’s a “hello world” example of a <code>main.tf</code> Terraform configuration file using
the <code>iterative_task</code> resource:</p>
<div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span>
<span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
<span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example"</span> <span class="token punctuation">{</span>
<span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span> <span class="token comment"># or any of: gcp, az, k8s</span>
<span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m"</span> <span class="token comment"># medium. Or any of: l, xl, m+k80, xl+v100, ...</span>
<span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"ubuntu"</span> <span class="token comment"># or "nvidia", ...</span>
<span class="token property">region</span> <span class="token punctuation">=</span> <span class="token string">"us-west"</span> <span class="token comment"># or us-west, eu-east, ...</span>
<span class="token property">disk_size</span> <span class="token punctuation">=</span> <span class="token number">30</span> <span class="token comment"># GB</span>
<span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span> <span class="token comment"># auto-price. Default -1 to disable or >0 for hourly USD limit</span>
<span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span> <span class="token comment"># max 24h before forced termination</span>
<span class="token keyword">storage</span> <span class="token punctuation">{</span>
<span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"."</span>
<span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results"</span>
<span class="token punctuation">}</span>
<span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END
#!/bin/bash
sudo apt update
sudo apt install -y python3-pip
pip3 install --user -r requirements.txt
python3 train.py
END</span>
<span class="token punctuation">}</span></code></pre></div>
<p>Once the training is complete, the Iterative Provider terminates the resource,
so users don't have to worry about spiraling costs from unused machines.</p>
<h2 id="configure-once-bring-everywhere" style="position:relative;">Configure Once, Bring Everywhere<a href="#configure-once-bring-everywhere" aria-label="configure once bring everywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Once you configure infrastructure and a script that executes your training
pipeline in a Terraform configuration file, you can bring that pipeline anywhere
you want. You can use such a config for ad-hoc training at any stage of your
prototyping process or use it as a job in your preferred CI/CD tool. You can
also store your infrastructure configuration files in a version control system
together with the rest of your project for easier control.</p>
<h2 id="one-provider-to-rule-them-all" style="position:relative;">One Provider to Rule Them All<a href="#one-provider-to-rule-them-all" aria-label="one provider to rule them all permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Whether you prefer Amazon Web Services (AWS), Microsoft Azure, Google Cloud
Platform (GCP), or Kubernetes (K8s), the Iterative Provider has you covered. You
can configure compute resources from these with a unified API, using
<a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">common machine types</a>
that are the same across all cloud vendors. This significantly simplifies
infrastructure configuration and makes it easy to migrate from one cloud to
another by changing just one line of code.</p>
<h2 id="costs-optimization" style="position:relative;">Costs Optimization<a href="#costs-optimization" aria-label="costs optimization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The Iterative Provider helps with cloud compute cost optimization in two major
ways. First, upon completion of your script, the instance is automatically
terminated. This helps to avoid accumulating costs due to abandoned resources.
Second, you can leverage the cost-saving power of spot instances to train your
models without losing any progress! TPI recovers the working directory and
respawns interrupted/preempted instances for you.</p>
<h2 id="devops-friendly" style="position:relative;">DevOps-Friendly<a href="#devops-friendly" aria-label="devops friendly permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last, but not least, the Iterative Provider aims to bridge the gap between
DevOps and Data Science teams. We build on top of Terraform, a tool universally
familiar to DevOps teams, but extend it to suit ML needs.</p>
<p>If you’d like to try the Iterative Provider in your project, check out the
documentation on the provider’s page in the Terraform registry, and if you have
any questions or suggestions, we welcome them in our
<a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">GitHub repository.</a></p>https://dvc.org/blog/CML-runners-saving-models-1https://dvc.org/blog/CML-runners-saving-models-1Tue, 26 Apr 2022 00:00:00 GMT<p>When you first develop a machine learning model, you will probably do so on your
local machine. You can easily change algorithms, parameters, and input data
right in your text editor, notebook, or terminal. Imagine you have a
long-running model for which you want to detect possible
<a href="https://en.wikipedia.org/wiki/Concept_drift" target="_blank" rel="nofollow noopener noreferrer">drift</a>, however. In that case it
would be beneficial to automatically retrain your model on a regular basis.</p>
<p>In this guide, we will show how you can use
<a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML (Continuous Machine Learning)</a> to do just that. CML is an
open-source library for implementing continuous integration and delivery (CI/CD)
in machine learning projects. This way we can define a pipeline to train a model
and keep track of various versions. Although we could do so directly in our
CI/CD pipeline (e.g. GitHub Actions Workflows), the runners used for this
generally don’t have a lot of processing power. Therefore it makes more sense to
provision a dedicated runner that is tailored to our computing needs.</p>
<p>At the end of this guide we will have set up a CML workflow that does the
following on a daily basis:</p>
<ol>
<li>Provision an Amazon Web Services (AWS) EC2 instance</li>
<li>Train the model</li>
<li>Save the model and its metrics to a GitHub repository</li>
<li>Create a pull request with the new outputs</li>
<li>Terminate the AWS EC2 instance</li>
</ol>
<p>In a follow-up post we will expand upon this by using <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to
designate a remote storage for our resulting models. But let's focus on CML
first!</p>
<p>All files needed for this guide can be found in
<a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>.</p>
<admon type="info">
<p>This guide can be followed on its own, but also as an extension to this
<a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">example in the docs</a>.</p>
</admon>
<admon type="tip">
<p>We wil be using GitHub for our CI/CD and AWS for our computing resources. With
slight modifications, however, you can use
<a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">GitLab CI/CD</a>,
<a href="https://cml.dev/doc/self-hosted-runners?tab=GCP#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Google Cloud</a>
or
<a href="https://cml.dev/doc/self-hosted-runners?tab=Azure#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure</a>.</p>
</admon>
<h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Before we begin, make sure you have the following things set up:</p>
<ol>
<li>You have
<a href="https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/" target="_blank" rel="nofollow noopener noreferrer">created an AWS account</a>
(free tier suffices)</li>
<li>You have
<a href="https://cml.dev/doc/self-hosted-runners?tab=GitHub#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">created a <code>PERSONAL_ACCESS_TOKEN</code> on GitHub</a>
with the <code>repo</code> scope</li>
<li>You have
<a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-creds" target="_blank" rel="nofollow noopener noreferrer">created an <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> on AWS</a></li>
<li>You have
<a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets" target="_blank" rel="nofollow noopener noreferrer">added the <code>PERSONAL_ACCES_TOKEN</code>, <code>AWS_ACCESS_KEY_ID</code>, and <code>AWS_SECRET_ACCESS_KEY</code> as GitHub secrets</a></li>
</ol>
<p>It also helps to clone
<a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">the template repository for this tutorial</a>.</p>
<h1 id="training-a-model-and-saving-it" style="position:relative;">Training a model and saving it<a href="#training-a-model-and-saving-it" aria-label="training a model and saving it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>To kick off, we will adapt <code>train.py</code> from the
<a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML getting started guide</a>. Here we create a
simple <code>RandomForestClassifier()</code> based on some generated data. We then use the
model to make some predictions and plot those predictions in a confusion matrix.</p>
<p>While running the script the model is kept in memory, meaning it is discarded as
soon as the script finishes. In order to save the model for later, we need to
dump it as a binary file. We do so with
<a href="https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html" target="_blank" rel="nofollow noopener noreferrer"><code>joblib.dump()</code></a>.
Later we can read the model using
<a href="https://joblib.readthedocs.io/en/latest/generated/joblib.load.html" target="_blank" rel="nofollow noopener noreferrer"><code>joblib.load()</code></a>
when we need to.</p>
<admon type="tip">
<p>You can also use <code>pickle.dump()</code> if you prefer.</p>
</admon>
<p>The outputs of <code>train.py</code> are:</p>
<ul>
<li><code>metrics.txt</code>: a file containing metrics on model performance (in this case
accuracy)</li>
<li><code>confusion_matrix.png</code>: a plot showing the classification results of our model</li>
<li><code>random_forest.joblib</code>: the binary output of the trained model</li>
</ul>
<p>All of these files are saved to the <code>model</code> directory.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> json
<span class="token keyword">import</span> os
<span class="token keyword">import</span> joblib
<span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>pyplot <span class="token keyword">as</span> plt
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>ensemble <span class="token keyword">import</span> RandomForestClassifier
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">import</span> plot_confusion_matrix
<span class="token comment"># Read in data</span>
X_train <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/train_features.csv"</span><span class="token punctuation">)</span>
y_train <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/train_labels.csv"</span><span class="token punctuation">)</span>
X_test <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/test_features.csv"</span><span class="token punctuation">)</span>
y_test <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/test_labels.csv"</span><span class="token punctuation">)</span>
<span class="token comment"># Fit a model</span>
depth <span class="token operator">=</span> <span class="token number">5</span>
clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span>max_depth<span class="token operator">=</span>depth<span class="token punctuation">)</span>
clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>X_train<span class="token punctuation">,</span> y_train<span class="token punctuation">)</span>
<span class="token comment"># Calculate accuracy</span>
acc <span class="token operator">=</span> clf<span class="token punctuation">.</span>score<span class="token punctuation">(</span>X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>acc<span class="token punctuation">)</span>
<span class="token comment"># Create model folder if it does not yet exist</span>
<span class="token keyword">if</span> <span class="token keyword">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span><span class="token string">"model"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
os<span class="token punctuation">.</span>makedirs<span class="token punctuation">(</span><span class="token string">"model"</span><span class="token punctuation">)</span>
<span class="token comment"># Write metrics to file</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"model/metrics.txt"</span><span class="token punctuation">,</span> <span class="token string">"w+"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> outfile<span class="token punctuation">:</span>
outfile<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">"Accuracy: "</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>acc<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"\n"</span><span class="token punctuation">)</span>
<span class="token comment"># Plot confusion matrix</span>
disp <span class="token operator">=</span> plot_confusion_matrix<span class="token punctuation">(</span>clf<span class="token punctuation">,</span> X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">,</span> normalize<span class="token operator">=</span><span class="token string">"true"</span><span class="token punctuation">,</span> cmap<span class="token operator">=</span>plt<span class="token punctuation">.</span>cm<span class="token punctuation">.</span>Blues<span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>savefig<span class="token punctuation">(</span><span class="token string">"model/confusion_matrix.png"</span><span class="token punctuation">)</span>
<span class="token comment"># Save the model</span>
joblib<span class="token punctuation">.</span>dump<span class="token punctuation">(</span>clf<span class="token punctuation">,</span> <span class="token string">"model/random_forest.joblib"</span><span class="token punctuation">)</span></code></pre></div>
<h1 id="train-the-model-on-a-daily-basis" style="position:relative;">Train the model on a daily basis<a href="#train-the-model-on-a-daily-basis" aria-label="train the model on a daily basis permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Now that we have a script to train our model and save it as a file, let’s set up
our CI/CD to provision a runner and run the script. We define our workflow in
<code>cml.yaml</code> and save it in the <code>.github/workflows</code> directory. This way GitHub
will automatically run the workflow whenever it is triggered. In this case the
triggers are on (manual) request as well as daily (automatic) schedule.</p>
<admon type="info">
<p>The name of the workflow doesn’t matter, as long as it’s a <code>.yaml</code> and located
in the <code>.github/workflows</code> directory. You can have multiple workflows in there
as well. You can learn more in the
<a href="https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions" target="_blank" rel="nofollow noopener noreferrer">documentation</a>
here.</p>
</admon>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token comment"># Here we use two triggers: manually and daily at 08:00</span>
<span class="token key atrule">workflow_dispatch</span><span class="token punctuation">:</span>
<span class="token key atrule">schedule</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">cron</span><span class="token punctuation">:</span> <span class="token string">'0 8 * * *'</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on EC2
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud=aws \
--cloud-region=eu-west \
--cloud-type=t2.micro \
--labels=cml-runner \
--single</span>
<span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml ci
pip install -r requirements.txt
python get_data.py
python train.py</span></code></pre></div>
<admon type="warn">
<p>In this example we are using a <code>t2.micro</code>
<a href="https://aws.amazon.com/ec2/instance-types/" target="_blank" rel="nofollow noopener noreferrer">AWS EC2 instance</a>. At the time of
writing this is included in the AWS free tier. Make sure that you qualify for
this free usage to prevent unexpected spending. When you specify a bulkier
<code>cloud-type</code>, your expenses will rise.</p>
</admon>
<p>The workflow we defined first
<a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">provisions a runner</a> on AWS, and then uses that
runner to train the model. After completing the training job, CML automatically
terminates the runner to prevent you from incurring further costs. Once the
runner is terminated, however, the model is lost along with it. Let's see how we
can save our model in the next step!</p>
<h1 id="export-the-model-to-our-git-repository" style="position:relative;">Export the model to our Git repository<a href="#export-the-model-to-our-git-repository" aria-label="export the model to our git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>CML allows us to export the model from our runner to our Git repository. Let's
extend the training stage of our workflow by pushing <code>random_forest.joblib</code> to a
new experiment branch and creating a pull request.</p>
<p><a href="https://cml.dev/doc/ref/pr" target="_blank" rel="nofollow noopener noreferrer"><code>cml pr</code></a> is the command that specifies which files
should be included in the pull request. The commands after that are used to
generate a report in the pull request that displays the confusion matrix and
calculated metrics.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml ci
pip install -r requirements.txt
python get_data.py
python train.py</span>
<span class="token comment"># Create pull request</span>
cml pr model/random_forest.joblib
<span class="token comment"># Create CML report</span>
cat model/metrics.txt <span class="token punctuation">></span> report.md
cml publish model/confusion_matrix.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md
cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update report.md</code></pre></div>
<p>Et voilà! We are now running a daily model training on an AWS EC2 instance and
saving the resulting model to our GitHub repository.</p>
<p>There is still some room for improvement, though. This approach works well when
our resulting model is small (less than 100MB), but we wouldn't want to store
large models in our Git repository. In a follow-up post we will describe how we
can use <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, another Iterative open-source tool, for storage
when we're dealing with larger files.</p>
<h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>There are many cases in which it's a good idea to retrain models periodically.
For example, you could be using the latest data available to you in order to
prevent model drift. CML allows you to automate this process.</p>
<p>In this guide, we explored how to set up CML for a daily training job using a
self-hosted runner. We automatically provisioned this runner on AWS, exported
the resulting files to our Git repository, and terminated the runner to prevent
racking up our AWS bill.</p>
<p>In a follow-up post we will explore how to use DVC when the resulting model is
too large to store directly in our Git repository.</p>
<p>Another great extension of our CI/CD would be a <code>deploy</code> step to bring the
latest version of our model into production. This step might be conditional on
the performance of the model; we could decide to only start using it in
production if it performs better than previous iterations. All of this warrants
a guide of its own, however, so look out for that in the future! 😉</p>https://dvc.org/blog/april-22-heartbeathttps://dvc.org/blog/april-22-heartbeatFri, 15 Apr 2022 00:00:00 GMT<details>
<p>This month's Heartbeat image is inspired by Community member Gudmundur
Heimisson. Gudmundur submitted some great PRs to update WebHDFS docs pending
some other issues in the DVC repo.</p>
<p>This image refelcts his Paris area team's view of Château de Vincennes out their
company windows!</p>
<p>We are grateful for all our Community members' contributions from all around the
world!</p>
<summary>✨Image Inspo✨</summary>
</details>
<p>Welcome to April! We have lots to ingest from the AI World and the Community so
let's get started with all the building blocks for success!</p>
<h2 id="ai-news" style="position:relative;">AI News<a href="#ai-news" aria-label="ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/l0JMrPWRQkTeg3jjO/giphy.gif" alt="Lego Rotate GIF by sheepfilms"></p>
<h3 id="the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-off--top-players-to-watch" style="position:relative;">The Future of AI Infrastructure is Becoming Modular: Why Best-of-Breed MLOps Solutions are Taking Off & Top Players to Watch<a href="#the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-off--top-players-to-watch" aria-label="the future of ai infrastructure is becoming modular why best of breed mlops solutions are taking off top players to watch permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/CasberW" target="_blank" rel="nofollow noopener noreferrer"><strong>Casber Wang</strong></a> of
<a href="https://twitter.com/SapphireVC" target="_blank" rel="nofollow noopener noreferrer">Sapphire VC</a> recently wrote
<a href="https://medium.com/sapphire-ventures-perspectives/the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-fd85c6ca8bcf" target="_blank" rel="nofollow noopener noreferrer">a piece in Medium</a>
on the necessary trend of AI infrastructure tooling becoming modular. He notes
three types of AI user types, "Off-the-shelfers," "Bet-the-Farmers," and "Rocket
Scientists." As the industry matures he makes the case (and we concur) for the
need for modular infrastructure tooling to provide AI teams with the most
flexible approach as they fine-tune their advancing and ever-growing processes.</p>
<blockquote>
<p>Where organizations used to seek all-in-one solutions to operationalize
machine learning (ML) due to limited in-house resources and expertise, we’re
seeing a rise in the demand for modular, best-in-class tooling that equips
today’s more robust ML teams with the ability to flexibly run highly-custom
and performant ML workloads.</p>
</blockquote>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9e3d9a1bac94ef897a27f78d1c41d8a7/39600/clayton-christensen.png" alt="Clayton Christensen's Modularity Theory" title="Clayton Christensen's Modularity Theory" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Clayton Christensen's Modularity Theory
(<a href="https://medium.com/sapphire-ventures-perspectives/the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-fd85c6ca8bcf" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<blockquote>
<p>Soon, large data teams will turn to modular toolkits with dozens of solutions
that manage different stages of the AI lifecycle. This will be particularly
true of the “bet-the-farmers”, who will need customized, best-in-class tools
that provide the flexibility that can match their exact challenge.</p>
</blockquote>
<p>Wang describes the different toolchain groupings in the AI Lifecycle and
discusses some of the players in each of them. DVC shows up in the Model
Evaluation & Experiment Tracking group, but soon you will see that our tools
deliver flexible, modular building blocks for some other pieces of the puzzle.</p>
<h2 id="data-distribution-shifts-and-monitoring" style="position:relative;">Data Distribution Shifts and Monitoring<a href="#data-distribution-shifts-and-monitoring" aria-label="data distribution shifts and monitoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen's</strong></a>
<a href="https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html" target="_blank" rel="nofollow noopener noreferrer">most recent blog post</a>
created for the course at Stanford
<a href="https://cs329s.stanford.edu/" target="_blank" rel="nofollow noopener noreferrer">CS 329S: Machine Learning Systems Design</a> goes
into detail on all things related to data distribution shifts and the monitoring
of them. The piece provides great examples to understand concepts such as
natural labels, the types of distribution shifts, causes of ML System failure,
and the metrics needed to monitor these things to determine when your model is
no longer producing the desired results. She discusses tools that can help
identify these shifts including logs, dashboards, and alerts, acknowledging the
pluses and minuses of each approach. Finally, the emergence of the favoring of
the term <em>observability</em> over <em>monitoring</em> is discussed because it is a stronger
concept for determining what went wrong with the internal states of a system by
observing the external outputs.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b0ee0c27cf955adeab2a7e0b2e35c49c/39600/chip-huyen.png" alt="Drift Detection Algorithms" title="Drift Detection Algorithms" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Drift detection algorithms by open-source package alibi-detect
(<a href="https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#monitoring-toolbox" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>Related to this, you can find a tutorial on how to detect drift and how to
correct your model with <a href="https://evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently AI</a> and DVC, see
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor's</strong></a> latest post on
<a href="https://dvc.org/blog/stale-models" target="_blank" rel="nofollow noopener noreferrer">Preventing Stale Models in Production!</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7abf9ed6a309e5b6310932fc37cd4777/39600/stale-model-cover.png" alt="Preventing Stale Models in Production" title="Preventing Stale Models in Production" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Preventing Stale Models in Production
(<a href="https://dvc.org/blog/stale-models" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="mlops-is-the-solution-for-machine-learning-and-ai-projects" style="position:relative;">MLOps is the Solution for Machine Learning and AI Projects<a href="#mlops-is-the-solution-for-machine-learning-and-ai-projects" aria-label="mlops is the solution for machine learning and ai projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The team at <a href="https://xpresso.ai" target="_blank" rel="nofollow noopener noreferrer"><strong>xpresso.ai</strong></a> created
<a href="https://xpresso.ai/resources/blogs/mlops-is-the-solution-for-machine-learning-and-ai-projects/?utm_source=rss&utm_medium=rss&utm_campaign=mlops-is-the-solution-for-machine-learning-and-ai-projects" target="_blank" rel="nofollow noopener noreferrer">this short post</a>
about all the facets that make up MLOps. While the tried and true
<a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining" target="_blank" rel="nofollow noopener noreferrer">CRISP-DM</a>
model for Data Science takes us right up to production, MLOps encompasses
considerably more processes that keep and maintain a model in production over
time. You can see all of these things highlighted in their image below,
providing lots to ponder!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/75f5d2da64a689d6f93dbe57c92a3e97/03346/Machine-Learning-Operations.jpg" alt="Machine Learning Operations" title="Machine Learning Operations" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Machine Learning Operations
(<a href="https://xpresso.ai/resources/blogs/mlops-is-the-solution-for-machine-learning-and-ai-projects/?utm_source=rss&utm_medium=rss&utm_campaign=mlops-is-the-solution-for-machine-learning-and-ai-projects" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="kaushik-shakkari-the-three-environments-for-ai-professionals--research-development-and-production" style="position:relative;">Kaushik Shakkari: The three environments for AI Professionals — Research, Development, and Production<a href="#kaushik-shakkari-the-three-environments-for-ai-professionals--research-development-and-production" aria-label="kaushik shakkari the three environments for ai professionals research development and production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b59a0f18177f3039887a5e1efa21fe23/39600/kaushik-shakkari.png" alt="The three environments for AI Professionals - Research, Development, and Production" title="The three environments for AI Professional - Research, Development, and Production =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
If your head is spinning with all the ample facets of the MLOps world as
outlined in xpresso.ai's diagram above and where you fit, or in the AI world in
general, <a href="https://www.linkedin.com/in/kaushik-shakkari/" target="_blank" rel="nofollow noopener noreferrer"><strong>Kaushik Shakkari</strong></a>
wrote
<a href="https://kaushikshakkari.medium.com/the-three-environments-for-ai-professionals-research-development-and-production-cffb86dfe533" target="_blank" rel="nofollow noopener noreferrer">this article</a>
dividing up the AI space into three environments: Research, Development, and
Production. He goes into detail about the type of work, skillsets, and roles
found in each. This breakdown can help the reader zero in on where he or she may
best fit and be fulfilled in this vast and often confusing space as well as
determine a pathway for their career.</p>
<h3 id="yashaswi-nayak-continuous-machine-learning---an-introduction-to-cml-iterativeai" style="position:relative;">Yashaswi Nayak: Continuous Machine Learning - An Introduction to CML (Iterative.ai)<a href="#yashaswi-nayak-continuous-machine-learning---an-introduction-to-cml-iterativeai" aria-label="yashaswi nayak continuous machine learning an introduction to cml iterativeai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ed480cafde4f667400e7defddc2f3400/39600/yashaswi-nayak.png" alt="Continuous Machine Learning - An Introduction to CML" title="Continuous Machine Learning - An Introduction to CML =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://twitter.com/YashaswiNayak" target="_blank" rel="nofollow noopener noreferrer"><strong>Yahaswi Nayak</strong></a> writes
<a href="https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da" target="_blank" rel="nofollow noopener noreferrer">a wonderful guide</a>
for data scientists and engineers, filled with great story-telling and fun
images created by the author about using <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> to provide CI/CD
to ML projects. He discusses the usual software development cycle using Git and
then follows with the complexities introduced by ML projects. He identifies the
reasons why CML is needed in the ML space, and how CML works.</p>
<p>Yahaswi gives the scenario of a team working on a classifier problem and how CML
would work for different team members tackling different parts of the problem.
He details all the questions a CML.yml file answers and takes care of in the
workflow. Finally, he lists a number of use cases for readers to try out with
CML. We'd love to see some Community members write about some of these use cases
that they've put into action!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a71f83c3de3d73e01b53560003789e21/03346/cml-workflow.jpg" alt="Continuous Machine Learning" title="Continuous Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>CML workflow
(<a href="https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="zoumana-keita-mlops--data-and-model-versioning-with-dvc-and-azure-blob-storage" style="position:relative;">Zoumana Keita: MLops — Data And Model Versioning With DVC and Azure Blob Storage<a href="#zoumana-keita-mlops--data-and-model-versioning-with-dvc-and-azure-blob-storage" aria-label="zoumana keita mlops data and model versioning with dvc and azure blob storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you've ever struggled with setting up your Azure Blob Storage with DVC, or
you know you will need to in the near future, you're in luck!
<a href="https://twitter.com/zoumana_keita_" target="_blank" rel="nofollow noopener noreferrer"><strong>Zoumana Keita</strong></a> shows you how to do just
that
<a href="https://towardsdatascience.com/large-data-versioning-with-dvc-and-azure-blob-storage-a-complete-guide-b97344827c81" target="_blank" rel="nofollow noopener noreferrer">in this post</a>
in <a href="https://towardsdatascience.com" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science.</a> He recently was
struggling with the same problem and team member,
<a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia Castro</a> came to the
rescue on our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord Server.</a> Zoumana was
kind enough to write a blog article on the detailed steps for the benefit of the
Community.</p>
<p>At this point in this Heartbeat, you probably grasp the importance of data,
model, and experiment versioning and how DVC easily versions large files in
conjunction with Git, which Zoumana describes. But he then takes you on a
detailed journey with screenshots of all the steps to get DVC set up with Azure
Blob Storage. Many thanks for this tutorial! 🙏🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/large-data-versioning-with-dvc-and-azure-blob-storage-a-complete-guide-b97344827c81" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps — Data And Model Versioning With DVC And Azure Blob Storage</h4>
<div class="elp-description">Zoumana Keita's detailed tutorial on how to set up Azure Blob Storage with DVC</div>
<div class="elp-link">https://towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-04-15/zoumana-keita-19150abaef96d64b94afb3d616881d45.png" alt="MLOps — Data And Model Versioning With DVC And Azure Blob Storage">
</div>
</a>
</section>
<p></p>
<h3 id="ahmed-abdullah-perfect-way-of-versioning-models--training-data" style="position:relative;">Ahmed Abdullah: Perfect Way of Versioning Models & Training Data<a href="#ahmed-abdullah-perfect-way-of-versioning-models--training-data" aria-label="ahmed abdullah perfect way of versioning models training data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/ahmed-abdullah-7b1806180/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ahmed Abdullah</strong></a>
<a href="https://medium.com/red-buffer/perfect-way-of-versioning-models-training-data-318819a1510d" target="_blank" rel="nofollow noopener noreferrer">wrote this tutorial</a>
in Medium about how to get DVC set up to version your data and models with a
Google Drive. He takes you in detail through the steps and discusses many of the
reasons why this versioning is important to your success as an ML engineer
including ever-changing data, effective collaboration with teammates, and the
need for keeping data separated from code for security reasons.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/red-buffer/perfect-way-of-versioning-models-training-data-318819a1510d" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Perfect Way of Versioning Models & Training Data</h4>
<div class="elp-description">Ahmed Abdullah's detailed tutorial on using DVC for versioning data, models with a Google Drive</div>
<div class="elp-link">https://medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-04-15/ahmed-abdullah-191bb07c64bc8df20e07f777f49e602a.png" alt="Perfect Way of Versioning Models & Training Data">
</div>
</a>
</section>
<p></p>
<h2 id="conference-news" style="position:relative;">Conference News<a href="#conference-news" aria-label="conference news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In-person conferences are going on and we are excited to be able to see the
Community in person again!</p>
<ul>
<li><a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer"><strong>Gift Ojeabulu</strong></a> presented at
<a href="https://festival.oscafrica.org/" target="_blank" rel="nofollow noopener noreferrer">Open Source Festival 2022</a> in Lagos, Nigeria
with the talk: <em>MLOps Exploration with Git & DVC for Machine Learning Project
on DAGsHub</em>
[<a href="https://speakerdeck.com/giftojabu1/mlops-exploration-with-git-and-dvc-for-machine-learning-project-on-dagshub?slide=2" target="_blank" rel="nofollow noopener noreferrer">slides</a>]</li>
<li><a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer"><strong>Antoine Toubhans</strong></a> presented
<em>Flexible ML Experiment Tracking System for Python Coders with DVC and
Streamlit</em> at PyCon Berlin
[<a href="https://github.com/sicara/pycon-2022-dvc-streamlit" target="_blank" rel="nofollow noopener noreferrer">repo, slides</a>]</li>
<li><a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Castro Iglesia</strong></a>
presented <em>Making MLOps Uncool Again</em> at PyCon Berlin
[<a href="https://github.com/iterative/workshop-uncool-mlops" target="_blank" rel="nofollow noopener noreferrer">repo</a>]</li>
<li>Next week at <a href="https://odsc.com/boston/" target="_blank" rel="nofollow noopener noreferrer">ODSC East</a>, come see
<a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> presenting <em>Model
Registry with OpenSource Tools: Git, GitHub, and CI/CD</em>;
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> with <em>Preventing
Stale Models in Production</em>; and
<a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> <em>Reproducibility, ML Pipelines,
and CI/CD in Computer Vision Projects</em>
<a href="https://odsc.com/boston/schedule/" target="_blank" rel="nofollow noopener noreferrer">more info</a></li>
<li>Visit us at <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> June 9-10!</li>
</ul>
<h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="online-course-updates" style="position:relative;">Online Course Updates<a href="#online-course-updates" aria-label="online course updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><img src="https://media.giphy.com/media/EdRgVzb2X3iJW/giphy.gif" alt="Surprised Owl GIF"></p>
<p>We've grown from 250 students last month to 450 right now!🎉 We are so happy to
see you all in the <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">platform</a> learning! What's coming:</p>
<ul>
<li>We have heard from some of you that you would like captions. We are working on
it!</li>
<li>Course guide - you will start to see each video have a course guide that will
have corresponding resources, explanations, and diagrams for those lessons and
be able to take your own notes.</li>
</ul>
<p>Thank you to all who have provided feedback after each course module! We are
going through this feedback, making adjustments, and keeping them in mind for
the next course!</p>
<h3 id="5-new-hires" style="position:relative;">5 New Hires!🎉<a href="#5-new-hires" aria-label="5 new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/dan-martinec-30739a54/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dan Martinec</strong></a> joins us
from the Czech Republic as a field data scientist. Dan first learned about
Iterative through using DVC in his work as an ML Engineer. Dan originally
studied Control Engineering at CTU in Prague. He graduated with a PhD and has
worked in various fields (C++ development at Porsche, mathematical optimization
in a small start-up, ML engineer at Avast). When not working Dan enjoys hobby
projects in the garden such as building my own storage lodge for firewood,
building a wooden composter, implementing a wireless water level reader in the
water tank, etc. And after that hard work, he is known to appreciate a good
movie. Welcome, Dan!</p>
<p><a href="https://www.linkedin.com/in/yury-kasimov-103962b8/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yury Kasimov</strong></a> also
joins us from Prague, the Czech Republic as Field Data Scientist. He studied
Robotics during his Bachelor's studies and then Artificial Intelligence for his
Master degree. Yury worked for some as a part of a university group that helps
protect NGOs from different cyber attacks. Prior to joining the team, he spent
the last 4 years as an ML engineer at Avast. In his free time, Yury plays a lot
of tennis and is learning to play the drums. He speaks English, Czech, Russian,
and a bit of Spanish. Bienvenidos, Yury!</p>
<p><a href="https://www.linkedin.com/in/chazblack1/" target="_blank" rel="nofollow noopener noreferrer"><strong>Chaz Black</strong></a> joins us as an Account
Executive from Atlanta, Georgia. Most recently he worked at H2O.ai leading their
business development team for 3 years. When Chaz is not helping clients, you may
find him checking out the ever-growing Atlanta food scene and hunting new and
exciting coffees and brewing styles. He is also a big audiophile and like many
on our team, Chaz enjoys board and video games when he has the time, with his
two cats looking over his shoulder. Welcome, Chaz!</p>
<p>Many in our Community already know our latest hire,
<a href="https://github.com/dacbd" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Barnes</strong></a>, as he has already been a great
contributor to our tools! We are excited to welcome him officially to the team
as a Software Engineer. Daniel is based in the Seattle, Washington area, having
recently moved back after two years in Korea. He has had a varied career path,
starting in IT security, programming, as a medic, then cyber in the US military,
and then to PACCAR where he discovered our open-source community! When not
solving complex software engineering challenges, Daniel has been noted as a bit
of an adrenaline junky with "hobbies" like skydiving, paragliding, and
motorcycles. Welcome, Daniel!</p>
<p><a href="https://www.linkedin.com/in/maximaginsky/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maxim Aginsky</strong></a> joins the team as
a Senior Product Designer from Montreal, Canada, marking our 4th employee from
the Province of Quebec! Maxim has worn many hats over the years working on
Product Development and most recently was the Director of Design for a Montreal
Fintech company. You can <a href="https://arrowww.space/" target="_blank" rel="nofollow noopener noreferrer">explore his portfolio here.</a>
Welcome, Maxim!</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with our amazing new additions to the team, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions and share with anyone you think may be
interested! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We've been following along on <a href="https://twitter.com/__anavc__" target="_blank" rel="nofollow noopener noreferrer"><strong>Anna's</strong></a>
journey through #100daysofcode to learn DVC. And now she's working on a project
of her own using Amazon Best Seller data.</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/stale-modelshttps://dvc.org/blog/stale-modelsThu, 31 Mar 2022 00:00:00 GMT<admon type="info">
<p>This post hasn't been updated since its release and the repo is currently
broken. Our team is in the process of updating it. Nonetheless, the concepts
described still hold true and you should be able to follow along with minor
changes.</p>
</admon>
<p>What happens when the model you've worked so hard to get to production becomes
stale? Machine learning engineers and data scientists face this problem all the
time. You usually have to figure out where the data drift started so you can
determine what input data has changed. Then you need to retrain the model with
this new dataset.</p>
<p>Retraining could involve a number of experiments across multiple datasets, and
it would be helpful to be able to keep track of all of them. In this tutorial,
we'll walk through how using DVC can help you keep track of those experiments
and how this will speed up the time it takes to get new models out to
production, preventing stale ones from lingering too long.</p>
<h2 id="setting-up-the-project" style="position:relative;">Setting up the project<a href="#setting-up-the-project" aria-label="setting up the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll be working with a project from
<a href="https://evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">Evidently.ai</a>
that demonstrates what it would be like to work with a production model that
experiences data drift over time. We'll take this to the next level by adding
some automation with a DVC pipeline and share the results with others using DVC
Studio.</p>
<p>So we'll start by cloning
<a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo for the project</a>.
This project is based on the one created by
<a href="https://github.com/evidentlyai/evidently/blob/main/examples/data_stories/bicycle_demand_monitoring.ipynb" target="_blank" rel="nofollow noopener noreferrer">evidently.ai</a>
with some modifications to work with DVC and different datasets.</p>
<p>The reason we're adding DVC and Studio to this project is to automate the way
our model evaluation pipeline runs and to version our data as we get new data.
We'll be able to share and review the results for each experiment run we do. One
of the big problems in machine learning is collaboration, so making it easier to
share models, data, and results can save your team a lot of time and
frustration.</p>
<h2 id="set-up-data-drift-reports" style="position:relative;">Set up data drift reports<a href="#set-up-data-drift-reports" aria-label="set up data drift reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When the data in production starts to look different from the data that your
model was trained, this is called data drift. There are a number of tools that
help monitor for data drift like <a href="https://docs.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">evidently.ai</a>
or <a href="https://docs.aporia.com/" target="_blank" rel="nofollow noopener noreferrer">Aporia</a>.</p>
<p>Since we're working with Evidently.ai, you can see target drift report when you
run the notebook for the initial project they made. Here's what it looks like.</p>
<p><img src="https://thumb.tildacdn.com/tild6336-3231-4736-b136-646539326135/-/format/webp/4_week3_pred_actual.png" alt="image of the report showing the target drift"></p>
<p>So we see at the end of Week 3 the model is in pretty bad shape. This is where
we can bring in DVC to help us get this stale model off of production faster.</p>
<h2 id="running-a-training-experiment-to-get-production-up-to-date" style="position:relative;">Running a training experiment to get production up to date<a href="#running-a-training-experiment-to-get-production-up-to-date" aria-label="running a training experiment to get production up to date permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll start by taking a year's worth of data and creating a new model. This
might give us a more accurate model to push to production than using weekly
data. So we'll take all the data from 2011 (because that's the dataset we have
to work with) and make our training and testing datasets. Then we'll check this
data into DVC, so it can version it with the following commands:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data/train.pkl data/test.pkl
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> data/.gitignore data/train.pkl.dvc data/test.pkl.dvc</span></code></pre></div>
<p>We add the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files to Git to ensure that we are only checking in the
metadata for the datasets and not the entire dataset files. Now we can run the
entire MLOps pipeline with
<a href="https://dvc.org/doc/command-reference/exp/run" target="_blank" rel="nofollow noopener noreferrer">this command</a>:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>This will execute the commands we've defined in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> and it will give us
the metrics to evaluate how good the model is. Let's take a look at the metrics
so far with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ workspace │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │
│ main │ 0.60791 │ 0.45758 │ 20210428 │ 375 │ 64 │
│ └── 801fdff [exp-a80c0] │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │
</span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div>
<p>This model doesn't have the best metrics, so we can run more experiments to see
if tuning hyperparameters will help before we deploy this model to production.
Let's change the values of the <code>train.n_est</code> and <code>train.n_est</code> hyperparameters.
We'll
<a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">run several experiments</a>
with different values and it will produce a table similar to this:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ workspace │ 0.43501 │ 0.79082 │ 20210428 │ 475 │ 28 │
│ main │ 0.60791 │ 0.45758 │ 20210428 │ 375 │ 64 │
│ ├── 78d29aa [exp-f06bb] │ 0.43501 │ 0.79082 │ 20210428 │ 475 │ 28 │
│ ├── 8fb41cf [exp-1323d] │ 0.42796 │ 0.80841 │ 20210428 │ 425 │ 28 │
│ ├── 434a82f [exp-63459] │ 0.36044 │ 0.87037 │ 20210428 │ 350 │ 28 │
│ ├── 549586e [exp-ceb6d] │ 0.61998 │ 0.4306 │ 20210428 │ 350 │ 64 │
│ ├── fbf8760 [exp-affe2] │ 0.68824 │ 0.50067 │ 20210428 │ 425 │ 64 │
│ ├── 732ab92 [exp-f8e8d] │ 0.65138 │ 0.49431 │ 20210428 │ 500 │ 64 │
│ └── 801fdff [exp-a80c0] │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │
</span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div>
<p>We've run a few experiments with a different hyperparameter value each time and
it looks like <code>exp-63459</code> is the best one out of them based on both average
precision and the ROC-AUC value. So we'll apply this experiment to our workspace
and choose this model as the one that will go to production. To apply the
experiment, we'll run the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-c85c3</span></code></pre></div>
<p>This will update the workspace with the exact code, data, and hyperparameters
that were used in that particular experiment. So we can commit these changes to
Git and we'll have a reference to everything we need for this exact model. Now
let's say we have deployed this to production and it's been a great model for
almost another year, then we start noticing data drift again.</p>
<h2 id="running-more-training-experiments-with-new-data" style="position:relative;">Running more training experiments with new data<a href="#running-more-training-experiments-with-new-data" aria-label="running more training experiments with new data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>That means it's time to update our dataset with the latest data from production
and that will include all the data on bike sharing in 2012 (because this is the
newer data we have to train with). DVC will note the changes in the data and
create a new version record for the updated data automatically.</p>
<p>Next we'll run a new experiment in the project with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>Then we can take a look at the metrics with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div>
<p>Since we cleared our workspace by pushing the changes to Git, we'll have a fresh
table to look at. Now you should see a table similar to this:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ workspace │ 0.42526 │ 0.82722 │ 20210428 │ 400 │ 28 │
│ main │ 0.69744 │ 0.63056 │ 20210428 │ 475 │ 32 │
│ ├── e76a89d [exp-7d207] │ 0.42526 │ 0.82722 │ 20210428 │ 400 │ 28 │
│ ├── 2a6d647 [exp-7526d] │ 0.74411 │ 0.65808 │ 20210428 │ 400 │ 32 │
│ ├── 467fd3d [exp-dfabd] │ 0.71431 │ 0.6267 │ 20210428 │ 450 │ 32 │
│ ├── 2a2171c [exp-45493] │ 0.58291 │ 0.49201 │ 20210428 │ 350 │ 48 │
│ └── 683dc49 [exp-2649a] │ 0.58421 │ 0.5783 │ 20210428 │ 475 │ 48 │
</span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div>
<p>Having the updated dataset made a huge difference in the metrics, and it looks
like this model has a different set of hyperparameters that perform well. Now
that we have all of the experiments with both the old and new datasets, this is
a good time to share the results with your coworkers and get some feedback.</p>
<h2 id="viewing-experiment-results-in-dvc-studio" style="position:relative;">Viewing experiment results in DVC Studio<a href="#viewing-experiment-results-in-dvc-studio" aria-label="viewing experiment results in dvc studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Because we already have DVC set up in this project, we can run as many
experiments as we need to, and it will track which datasets we're working with,
the code changes that we make, and it'll let us look at all the results from
each experiment in Studio.</p>
<p>If you go to <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>, you'll be
prompted to connect to your GitHub/GitLab account and you'll be able to choose
the repo for this project. Once you're connected, you should be able to see all
the experiments you've pushed to your Git history.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/928412502573b85d1c9da6ef8b136d4c/39600/stale_models_in_studio.png" alt="example of plots and results in DVC Studio" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>You can give others on your team access to this, and they'll be able to run new
experiments and see the results right in the browser. This is a great tool to
use to discuss the next best steps in your model training before you're ready to
deploy.</p>
<h2 id="deploy-new-model-to-production" style="position:relative;">Deploy new model to production<a href="#deploy-new-model-to-production" aria-label="deploy new model to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The output of our training stage is the file for the <code>model.pt</code>. Now all we need
to do is get this to our production environment. That could be a web API that
returns results in real-time, or you could do some kind of batch prediction.
Regardless of how you deploy to production, you now have a model that's been
updated to account for the previous data drift.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now you just have to keep an eye on this new model to make sure that it does
stray too far from the results you expect. This is one of the processes you can
use to keep your production models from going stale. You could even automate
some parts of this process if you know what your thresholds are!</p>https://dvc.org/blog/march-22-community-gemshttps://dvc.org/blog/march-22-community-gemsWed, 30 Mar 2022 00:00:00 GMT<h3 id="what-is-the-difference-between-using-dvc-exp-run-and-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939070512322195456" target="_blank" rel="nofollow noopener noreferrer">What is the difference between using <code>dvc exp run</code> and <code>dvc repro</code>?</a><a href="#what-is-the-difference-between-using-dvc-exp-run-and-dvc-repro" aria-label="what is the difference between using dvc exp run and dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a really good question from @v2.03.99!</p>
<p>When you use <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, DVC automatically tracks each experiment run. Using
<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> leaves it to the user to track each experiment.</p>
<p>You can learn how <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> uses custom Git refs to track experiments in
this <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">blog post</a> and you can see a quick
technical overview in
<a href="https://dvc.org/doc/user-guide/experiment-management/experiments-overview" target="_blank" rel="nofollow noopener noreferrer">the docs here</a>.</p>
<h3 id="what-is-a-good-way-to-debug-dvc-stages-in-vscode" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939269709780643861" target="_blank" rel="nofollow noopener noreferrer">What is a good way to debug DVC stages in VSCode?</a><a href="#what-is-a-good-way-to-debug-dvc-stages-in-vscode" aria-label="what is a good way to debug dvc stages in vscode permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A great question here from @quarkquark!</p>
<p>You can debug in VSCode by following the steps below:</p>
<ul>
<li>Install the <code>debugpy</code> package.</li>
<li>Navigate to <code>"Run and Debug" > "Remote Attach" > localhost > someport</code>.</li>
<li>In a terminal in VSCode,
<code>python -m debugpy --listen someport --wait-for-client -m dvc mycommand</code></li>
</ul>
<p>This should help you debug the stages in your pipeline in the IDE and you can
find
<a href="https://github.com/iterative/dvc/wiki/Debugging-DVC-interactively" target="_blank" rel="nofollow noopener noreferrer">more details here</a>.</p>
<h3 id="is-there-a-way-to-list-what-files-and-ideally-additional-info-like-location-md5-etc-are-within-a-directory-tracked-by-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/940318136568258650" target="_blank" rel="nofollow noopener noreferrer">Is there a way to list what files (and ideally additional info like location, MD5, etc) are within a directory tracked by DVC?</a><a href="#is-there-a-way-to-list-what-files-and-ideally-additional-info-like-location-md5-etc-are-within-a-directory-tracked-by-dvc" aria-label="is there a way to list what files and ideally additional info like location md5 etc are within a directory tracked by dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for asking @CarsonM!</p>
<p>You should be able to use DVC to list the directory contents of your DVC remotes
without pulling the repo. Here's an example of the command you can run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc list</span> https://github.com/iterative/dataset-registry/ fashion-mnist/raw</span></code></pre></div>
<h3 id="if-we-have-multiple-datasets-is-it-recommended-to-have-1-remote-per-dataset-or-to-have-1-remote-and-let-dvc-handle-the-paths" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/943213340195434546" target="_blank" rel="nofollow noopener noreferrer">If we have multiple datasets, is it recommended to have 1 remote per dataset or to have 1 remote and let DVC handle the paths?</a><a href="#if-we-have-multiple-datasets-is-it-recommended-to-have-1-remote-per-dataset-or-to-have-1-remote-and-let-dvc-handle-the-paths" aria-label="if we have multiple datasets is it recommended to have 1 remote per dataset or to have 1 remote and let dvc handle the paths permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a really interesting question from @BrownZ!</p>
<p>It really depends on your use case. Separated remotes might be useful if you
want to have granular control over permissions for each dataset.</p>
<p>In general, we would suggest a single remote and setting up a
<a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry</a> to handle the
different datasets through DVC.</p>
<h3 id="is-there-a-mailing-list-for-subscribing-to-cml-releases" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/939215540591927337" target="_blank" rel="nofollow noopener noreferrer">Is there a mailing list for subscribing to CML releases?</a><a href="#is-there-a-mailing-list-for-subscribing-to-cml-releases" aria-label="is there a mailing list for subscribing to cml releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It's awesome community members like @pria want to keep up with our releases!</p>
<p>You can follow all of our releases via GitHub notifications. You can browse
release notes at <a href="https://github.com/iterative/cml/releases" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/cml/releases</a>. You can also
subscribe to release updates by clicking the <code>Watch</code> button in the top-right,
navigating to <code>Custom</code>, and checking the <code>Releases</code> option.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 166px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb18c4360f0c57b120be336596dc0a9d/ca0b1/cml-release-follow.png" alt="the checkbox you need to check in GitHub to follow CML releases" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="does-cml-send-comment-work-for-azure-devops-repositories" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/947986936994353293" target="_blank" rel="nofollow noopener noreferrer">Does <code>cml-send-comment</code> work for azure devops repositories?</a><a href="#does-cml-send-comment-work-for-azure-devops-repositories" aria-label="does cml send comment work for azure devops repositories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @1cybersheep1!</p>
<p>Currently, the supported Source Code Management tools are GitHub, GitLab, and
Bitbucket. Other SCMs may be a part of the roadmap later on.</p>
<h3 id="if-my-model-is-training-on-a-self-hosted-local-runner-and-i-already-have-a-shared-dvc-cache-set-up-on-the-same-machine-is-there-a-good-way-for-my-github-workflow-to-access-that-cache-instead-of-having-to-redownload-it-all-from-cloud-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/951240652035883008" target="_blank" rel="nofollow noopener noreferrer">If my model is training on a self-hosted, local runner, and I already have a shared DVC cache set up on the same machine, is there a good way for my GitHub workflow to access that cache instead of having to redownload it all from cloud storage?</a><a href="#if-my-model-is-training-on-a-self-hosted-local-runner-and-i-already-have-a-shared-dvc-cache-set-up-on-the-same-machine-is-there-a-good-way-for-my-github-workflow-to-access-that-cache-instead-of-having-to-redownload-it-all-from-cloud-storage" aria-label="if my model is training on a self hosted local runner and i already have a shared dvc cache set up on the same machine is there a good way for my github workflow to access that cache instead of having to redownload it all from cloud storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Excellent question from @luke_imm!</p>
<p>In GitHub, you can mount volumes to your container, but you have to declare them
within the
<a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-running-a-job-within-a-container" target="_blank" rel="nofollow noopener noreferrer">workflow YAML</a></p>
<hr>
<p><img src="https://media.giphy.com/media/3o6Mbnll2gudglC3HG/giphy.gif" alt="Season 3 Race GIF"></p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/march-22-heartbeathttps://dvc.org/blog/march-22-heartbeatThu, 17 Mar 2022 00:00:00 GMT<h1 id="on-the-war-in-ukraine-" style="position:relative;">On the war in Ukraine 🇺🇦<a href="#on-the-war-in-ukraine-" aria-label="on the war in ukraine permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>While the war in Ukraine has impacted the world, it has also greatly impacted
our company as we have team members living in Ukraine and Russia, and many with
family ties to both. Our hearts are with our Iterative family in Ukraine and we
are committed to doing everything we can to support the safety of our Ukrainian,
as well as the transition of our Russian colleagues during this crisis.</p>
<p>We as a company are against this war. We have donated to the humanitarian
efforts to help the people of Ukraine and are matching our team members'
donations as well. We are proud of the perseverance, care, and support coming
from our team at this time.</p>
<p>If you are able, we ask that you consider these resources as ways to help. Our
hope is that the world will find a quick and peaceful end to this war and
Ukraine will be restored, even stronger than before. 🇺🇦</p>
<h2 id="donations" style="position:relative;">🪙 Donations<a href="#donations" aria-label="donations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://www.reddit.com/r/ukraine/comments/s6g5un/want_to_support_ukraine_heres_a_list_of_charities/" target="_blank" rel="nofollow noopener noreferrer">A list of charities with direct connections to Ukrainian people endorsed</a>
by the <a href="https://kyivindependent.com/" target="_blank" rel="nofollow noopener noreferrer">Kyiv Independent</a>. Everything on this
list except for the "Charities that help the war effort” section is for
humanitarian efforts only.</li>
<li><a href="https://bank.gov.ua/en/news/all/natsionalniy-bank-vidkriv-rahunok-dlya-gumanitarnoyi-dopomogi-ukrayintsyam-postrajdalim-vid-rosiyskoyi-agresiyi" target="_blank" rel="nofollow noopener noreferrer">Humanitarian Assistance to Ukrainians by National Bank of Ukraine</a></li>
<li><a href="https://www.unicefusa.org/?form=ukraine-emergency-match" target="_blank" rel="nofollow noopener noreferrer">UNICEF USA</a> (2x
additional match)</li>
<li><a href="https://www.unicef.org.uk/donate/donate-now-to-protect-children-in-ukraine/" target="_blank" rel="nofollow noopener noreferrer">UNICEF</a>
UK</li>
<li><a href="https://donate.unrefugees.org.uk/general/~my-donation?_cv=1" target="_blank" rel="nofollow noopener noreferrer">UNHCR</a></li>
<li><a href="https://donate.redcrossredcrescent.org/ua/donate/~my-donation?_cv=1" target="_blank" rel="nofollow noopener noreferrer">RedCross Ukraine</a>
(there are some concerns about this org - see
<a href="https://twitter.com/ptico/status/1502192685364531204" target="_blank" rel="nofollow noopener noreferrer">one</a>,
<a href="https://twitter.com/KyivIndependent/status/1501136976447168512" target="_blank" rel="nofollow noopener noreferrer">two</a>)</li>
<li><a href="https://donate.redcross.org.uk/appeal/ukraine-crisis-appeal" target="_blank" rel="nofollow noopener noreferrer">RedCross UK</a></li>
<li><a href="https://give.internationalmedicalcorps.org/page/99837/donate/1" target="_blank" rel="nofollow noopener noreferrer">International Medical Corps</a></li>
<li><a href="https://www.wfp.org/support-us/stories/ukraine-appeal" target="_blank" rel="nofollow noopener noreferrer">WFP</a></li>
<li><a href="https://www.ukrainecharity.org/war-crisis-692518.html" target="_blank" rel="nofollow noopener noreferrer">UKRAINECHARITY</a></li>
<li><a href="https://novaukraine.org/" target="_blank" rel="nofollow noopener noreferrer">NOVA UKRAINE</a></li>
<li><a href="https://www.gofundme.com/f/support-ukrainian-refugees-arriving-in-poland" target="_blank" rel="nofollow noopener noreferrer">GOFUNDME / Support Ukrainian Refugees Arriving In Poland</a></li>
<li><a href="https://www.doctorswithoutborders.org/what-we-do/countries/ukraine" target="_blank" rel="nofollow noopener noreferrer">Doctors Without Borders</a></li>
<li><a href="https://support.savethechildren.org/site/Donation2?df_id=5751&mfc_pref=T&5751.donation=form1" target="_blank" rel="nofollow noopener noreferrer">Save the Children</a></li>
<li><a href="https://www.icrc.org/en/donate/ukraine" target="_blank" rel="nofollow noopener noreferrer">ICRC</a></li>
<li><a href="https://secure.projecthope.org/site/SPageNavigator/2022_02_Ukraine_Response_Web_UNR.html&s_subsrc=oth" target="_blank" rel="nofollow noopener noreferrer">Project Hope</a></li>
<li><a href="https://www.flexport.org/donate-now" target="_blank" rel="nofollow noopener noreferrer">Flexport</a></li>
</ul>
<h2 id="️other-ways-to-help" style="position:relative;">❤️🩹 Other ways to help<a href="#%EF%B8%8Fother-ways-to-help" aria-label="️other ways to help permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li><a href="https://icanhelp.host/" target="_blank" rel="nofollow noopener noreferrer">I Can Help (hosting)</a></li>
<li><a href="https://www.airbnb.org/help-ukraine" target="_blank" rel="nofollow noopener noreferrer">Airbnb - host a refugee</a></li>
</ul>
<hr>
<h1 id="aiml-news" style="position:relative;">AI/ML News<a href="#aiml-news" aria-label="aiml news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/5YiRHZtcSeiEyOpSV7/giphy.gif" alt="Excited Marie Kondo GIF"></p>
<h2 id="mihail-eric-mlops-is-a-mess-but-thats-to-be-expected" style="position:relative;">Mihail Eric: MLOps is a Mess But That's to be Expected<a href="#mihail-eric-mlops-is-a-mess-but-thats-to-be-expected" aria-label="mihail eric mlops is a mess but thats to be expected permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/mihail_eric" target="_blank" rel="nofollow noopener noreferrer"><strong>Mihail Eric</strong></a> writes a long, but <em>really
worth it</em> piece entitled
<a href="https://www.mihaileric.com/posts/mlops-is-a-mess/" target="_blank" rel="nofollow noopener noreferrer">MLOps is a Mess But That’s to be Expected.</a>
In it he discusses the allure of seeking a machine learning career, only run
smack into the giant wall of learning that encompasses the space, not the least
of which is the multitude of tools to pick through once you get there. The state
of machine learning is reviewed and some history of DevOps for perspective on
MLOps is added.<br>
You will find advice for newcomers and some final, thorough, thoughts and
predictions especially as they relate to “ML at a reasonable scale” companies.<br>
Definitely worth your review!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d7d7813635625b97f263ea955bb7f77c/39600/hype-cycle-mihail-eric.png" alt="Gartner Hype cycle for MLOps" title="Gartner Hype cycle for MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Gartner Hype cycle for MLOps
(<a href="https://www.mihaileric.com/posts/mlops-is-a-mess/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="kevin-lu-learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship" style="position:relative;">Kevin Lu: Learn how to use Data Version Control to remove the third wheel from your relationship<a href="#kevin-lu-learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship" aria-label="kevin lu learn how to use data version control to remove the third wheel from your relationship permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8394d1b2ac723900cba241f2483b5cf5/ab158/kevin.png" alt="Learn how to use Data Version Control to remove the third wheel from your relationship" title="Learn how to use Data Version Control to remove the third wheel from your relationships =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
In
<a href="https://medium.com/@kevinylu/learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship-ce4c2afa649c" target="_blank" rel="nofollow noopener noreferrer">this hilarious post,</a>
<a href="https://medium.com/@kevinylu" target="_blank" rel="nofollow noopener noreferrer"><strong>Kevin Lu</strong></a> teaches us how to use DVC to enable
us to disconnect from our unhealthy addictive relationships with our computers
and make room for more human relationships! You don't want to miss the humor,
productivity and wisdom here, all while helping you understand how each of DVC's
commands help your machine learning engineering exploits.</p>
<h2 id="thanakorn-panyapiang-putting-a-machine-learning-model-into-production-with-google-cloud-platform-and-dvc" style="position:relative;">Thanakorn Panyapiang: Putting A Machine Learning model into production with Google Cloud Platform and DVC<a href="#thanakorn-panyapiang-putting-a-machine-learning-model-into-production-with-google-cloud-platform-and-dvc" aria-label="thanakorn panyapiang putting a machine learning model into production with google cloud platform and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Are you a data scientist new to putting models into production?<br>
<a href="https://towardsdatascience.com/putting-machine-learning-model-into-production-with-google-cloud-platform-and-dvc-f6a22cdcf4a5" target="_blank" rel="nofollow noopener noreferrer">In this piece</a> <a href="https://www.linkedin.com/in/tpanyapiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thanakorn Panyapiang</strong></a>
describes various model deployment strategies to put projects into production
including model-as-service, batch prediction and model-on-edge. In his example
he uses a batch prediction approach with an image segmentation model to identify
clouds. He uses DVC as a model registry with Google Cloud storage and GitHub
actions to automate the Cloud Functions deployment. See all the steps he
outlines in his piece to get real value out of your machine learning projects.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8fb6217eabb6588ca79eeb4ebe471cd1/03346/panyapiang.jpg" alt="Data Pipeline" title="Data Pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Data
Pipeline
(<a href="https://towardsdatascience.com/putting-machine-learning-model-into-production-with-google-cloud-platform-and-dvc-f6a22cdcf4a5" target="_blank" rel="nofollow noopener noreferrer">Source link: Author</a>)</em></p>
<h2 id="matthew-upson-mlops-for-conversational-ai-with-rasa-dvc-and-cml-partii" style="position:relative;">Matthew Upson: MLOps for Conversational AI with Rasa, DVC, and CML (PartII)<a href="#matthew-upson-mlops-for-conversational-ai-with-rasa-dvc-and-cml-partii" aria-label="matthew upson mlops for conversational ai with rasa dvc and cml partii permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the <a href="https://dvc.org/blog/december-21-heartbeat" target="_blank" rel="nofollow noopener noreferrer">December Heartbeat,</a> I told
you about <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Upson's</strong></a> first post in his
series on using DVC, CML and Rasa together.
<a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-ii-3a70fe2f357d" target="_blank" rel="nofollow noopener noreferrer">In this second post</a>
he goes through some Rasa basics and gets the DVC pipeline setup, with its train
and test stages, params, dependencies, outs and metrics. He also covers syncing
with DVC, making changes, the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command, the <code>.dvc-lock</code> file, and
pushing to remote storage. We're looking forward to the next installment when we
will see how CML can be used to automatically train the model.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d5f1cb82c955a6ecc1ddb237ed888689/39600/upson.png" alt="Rasa DVC metrics diff" title="Rasa DVC metrics diff" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC
metrics diff in Rasa project
(<a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-ii-3a70fe2f357d" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="sibanjan-das-mlops-for-enterprise-ai" style="position:relative;">Sibanjan Das: MLOps for Enterprise AI<a href="#sibanjan-das-mlops-for-enterprise-ai" aria-label="sibanjan das mlops for enterprise ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/sibanjan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sibanjan Das</strong></a> notes the trending of
the MLOps keyword in
<a href="https://dzone.com/articles/mlops-for-enterprise-ai" target="_blank" rel="nofollow noopener noreferrer">his piece</a> in
<a href="https://dzone.com" target="_blank" rel="nofollow noopener noreferrer">DZone.</a> Sibanjan gives an overview of MLOps and how it
supports the AI/ML ecosystem to deliver return on investment for ML projects. He
reviews the components of MLOps, including automated ML model building
pipelines, model serving, model version control, model/data monitoring, and
security and governance. He also discusses the MLOps maturity models of Google
and Microsoft (see below). I found this part especially interesting as it
mirrors what we see in our Community and how they develop using our tools as
well. Finally, he outlines some tools that help in the process, including DVC.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/69402d709c22ad8d698a6d67b39a2bf4/03346/das.jpg" alt="Comparing Google's and Microsoft's maturity models" title="Comparing Google's and Microsoft's maturity models" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Comparing Google's and Microsoft's maturity models
(<a href="https://dzone.com/articles/mlops-for-enterprise-ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="jagreet-kaur-implementing-devops-for-machine-learning---a-quick-guide" style="position:relative;">Jagreet Kaur: Implementing DevOps for Machine Learning - A Quick Guide<a href="#jagreet-kaur-implementing-devops-for-machine-learning---a-quick-guide" aria-label="jagreet kaur implementing devops for machine learning a quick guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cee3733976e3c8af79f13a22284e6f55/39600/jagreet-kaur.png" alt="Tensorflow, PyTorch, DVC, Docker, CI/CD" title="Continuous Development Life Cycle Guide from Xenostack =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://www.linkedin.com/in/jagreetkaur/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jagreet Kaur</strong></a> of
<a href="https://www.xenonstack.com/" target="_blank" rel="nofollow noopener noreferrer">Xenonstack</a> authors
<a href="https://www.xenonstack.com/blog/devops-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">a guide</a> on
applying DevOps to machine learning and generally what the continuous
development life cycle is as it relates to machine learning projects. Jagreet
goes over all the fun continuous topics including, continuous integration,
continuous testing, continuous retraining, and continuous deployment. She gives
an overview of the use of Tensor Flow, PyTorch, and Docker, as well as DVC for
version control, experiment management deployment, and collaboration. Additional
resources from Xenonstack are provided for further review.</p>
<h3 id="yuqi-li-why-mlops-should-be-open-source" style="position:relative;">Yuqi Li: Why MLOps should be Open Source<a href="#yuqi-li-why-mlops-should-be-open-source" aria-label="yuqi li why mlops should be open source permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/270790f920a14247fcc2e0ea0e2f80e6/03346/yuqi-li.jpg" alt="Why MLOps Tools should be Open Source" title="Why MLOps Tools should be Open Source =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<a href="https://www.linkedin.com/in/yuqiliofficial/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yuqi Li</strong></a>
<a href="https://towardsdatascience.com/why-mlops-tools-should-be-open-source-5ad696463f54" target="_blank" rel="nofollow noopener noreferrer">in this opinion piece,</a>
in <a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science.</a> overviews the
meaning and components of MLOps and identifies a number of good open-source
tools in the space which of course includes DVC. He also outlines a number of
reasons why MLOps should be open source. Among the reasons making the cut:</p>
<ol>
<li>Cost-Effectiveness</li>
<li>Ownership</li>
<li>No privacy concern</li>
<li>Build Community around the tool Examine these reasons to determine if open
source makes sense for your MLOps work. We think you will.</li>
</ol>
<h2 id="and-speaking-of-community" style="position:relative;">And speaking of Community…<a href="#and-speaking-of-community" aria-label="and speaking of community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h2 id="mert-bozkir-community-driven-learning" style="position:relative;">Mert Bozkir: Community-Driven Learning<a href="#mert-bozkir-community-driven-learning" aria-label="mert bozkir community driven learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you’ve been in our Discord server, been to one of our Meetups, or interacted
with us on Twitter, you’ve surely come across DVC Community All-Star
<a href="https://github.com/mertbozkir" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir</strong></a>. Mert has written
<a href="https://medium.com/@mertbozkir/community-driven-learning-2481103aa190" target="_blank" rel="nofollow noopener noreferrer">a great piece</a>
Entitled <em>Community Driven Learning</em> and describes how it is the best way to
learn. He outlines his reasoning for this including the support, encouragement,
and motivation you can get from the Community to be persistent in your learning
efforts. He also includes eight communities that are great for learning, with
invites included. Be sure to check it out!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6ec351cad71685ffed5d067d74c5ac38/03346/community.jpg" alt="Community Driven Learning" title="Community Driven Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Community Driven Learning
(<a href="https://unsplash.com/@john_cameron" target="_blank" rel="nofollow noopener noreferrer">Source link: Unsplash by john_cameron</a>)</em></p>
<h2 id="and-speaking-of-learning" style="position:relative;">And speaking of learning…<a href="#and-speaking-of-learning" aria-label="and speaking of learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/3ohuAxV0DfcLTxVh6w/giphy.gif" alt="GIF by Star Wars"></p>
<h2 id="online-courses-updates" style="position:relative;">Online Course(s) Updates<a href="#online-courses-updates" aria-label="online courses updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>
<p>We now have over <strong>250</strong> students taking the course and <strong>10</strong> students that
have completed the course! 🎉 Thank you to all who have given us feedback. We
are actively working on making adjustments to the course and improving the
next one.</p>
</li>
<li>
<p>We have a new look! The website for our online course, Iterative Tools for
Data Scientists and Analysts has been updated to be more streamlined to more
clearly identify what our students need in the course!</p>
</li>
<li>
<p>We have already begun working on the second course which will be more advanced
(remember those maturity models outlined in the article from DZone above?) and
will cover scenarios with CML. We are also working on creating an ebook for
each video that will provide relevant information, diagrams, and links with
the video content instead of being batched at the end of the module. The ebook
format will also let you take your own notes as you study!</p>
</li>
</ul>
<h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><img src="https://media.giphy.com/media/lQ0LC603dA96Gs2Hfx/giphy.gif" alt="My Team GIF by The Voice"></p>
<p><a href="https://www.linkedin.com/in/michael-moynihan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Moynihan</strong></a> joins us from
Brooklyn, NY as an Account Executive. He previously worked at Code Climate as
the Manager of Business Development and an Account Executive. Mike's really into
biking and will be participating in the 5-Boro Bike Tour in NYC this year. He's
also a baker and has been baking bread and other baked goods consistently for
about 3 years now. Finally, when not working or biking or baking, you may find
him playing one of the video or board games in his 500-strong collection.</p>
<p><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob De Wit</strong></a> joins our team from
Utrecht, the Netherlands as a Developer Advocate. Rob's first focus will be on
developing those new ebooks for our new online courses mentioned above. He has a
background in Information Sciences and previously worked at bol.com and
Devoteam. When not working, Rob likes photo and video editing, board games,
organizing meetups, and hiking (the Peaks of the Balkans are on his bucket
list).<br>
He also stays busy by learning Spanish and dabbling in local politics.</p>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="march-office-hours" style="position:relative;">March Office Hours!<a href="#march-office-hours" aria-label="march office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Be sure to join us at the
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">March Office Hours Meetup,</a>
where <a href="https://github.com/PythonFZ/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabian Zills</strong>,</a> PhD student at
<a href="https://www.uni-stuttgart.de/en/" target="_blank" rel="nofollow noopener noreferrer">University of Stuttgart,</a> will present his
ZnTrack ("zinc track") project which creates, runs and benchmarks DVC pipelines
in Python and Jupyter Notebooks.<br>
<a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">Find the repo here!</a></p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">March Office Hours - ZnTrack</h4>
<div class="elp-description">RSVP for DVC Office Hours - ZnTrack - Create, Visualize, Run and Benchmark DVC Pipelines in Python & Jupyter Notebooks </div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-03-17/office-hours-meetup-dcb241606953b111ec130fa158c4527b.png" alt="March Office Hours - ZnTrack">
</div>
</a>
</section>
<p></p>
<h2 id="conferenceshackathons" style="position:relative;">Conferences/Hackathons<a href="#conferenceshackathons" aria-label="conferenceshackathons permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ul>
<li>We will be sponsoring <a href="https://odsc.com/boston/" target="_blank" rel="nofollow noopener noreferrer">ODSC East</a> and
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> this year, so if you are attending,
we'd love to meet you IRL! Stop by our booth!</li>
<li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> will be speaking at
<a href="https://2022.pythonwebconf.com/" target="_blank" rel="nofollow noopener noreferrer">PythonWeb Conference</a> March 22nd on "Using
Reproducible Experiments to Creat Better Machine Learning Models."</li>
<li><a href="https://github.com/daavoo" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> will be presenting
his workshop "Making MLOps Uncool Again" at
<a href="https://mlopsworld.com/newyork/" target="_blank" rel="nofollow noopener noreferrer">MLOps World New York</a> on March 29th and at
<a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin</a> April 11th.</li>
<li>Community member <a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer"><strong>Gift Ojeabulu</strong></a> will
be giving a talk on "MLops Exploration with Git and DVC for Machine Learning
Project" at <a href="https://festival.oscafrica.org/" target="_blank" rel="nofollow noopener noreferrer">Open Source Festival 2022</a> March
24-26.</li>
<li><a href="https://www.battery.dev/" target="_blank" rel="nofollow noopener noreferrer">BatteryDev Hackathon</a> will take place next week and
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> will hold an Office
Hours for those needing help with DVC on March 21st</li>
<li><a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer"><strong>Antoine Toubhans</strong></a> will be presenting
his DVC integration with Streamlit at <a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin</a>
as well.</li>
</ul>
<h2 id="-new-docs" style="position:relative;">📖 New Docs<a href="#-new-docs" aria-label=" new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="cml-ci" style="position:relative;">CML CI<a href="#cml-ci" aria-label="cml ci permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>CML has a new command line reference that lets you prepare the Git repository
for CML operations. For more info on <code>cml ci</code>,
<a href="https://cml.dev/doc/ref/ci#command-reference-ci" target="_blank" rel="nofollow noopener noreferrer">check out the docs</a></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with our amazing new additions to the team, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions and share with anyone you think may be
interested! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We were really excited to the the <a href="https://www.sicara.ai/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> team all
decked out in their DVC swag this month in this Tweet. If you haven't seen the
video of <a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> integration
with Streamlit, you can
<a href="https://www.youtube.com/watch?v=F318uN01v7M&t=2s" target="_blank" rel="nofollow noopener noreferrer">see it on our YouTube channel</a>
or catch the presentation at this year's <a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin.</a></p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Swag received :) Thanks <a href="https://twitter.com/DVCorg">@DVCorg</a> !! We love DVC at @sicara_fr 👉 keep up the great work 👍 <a href="https://twitter.com/_Okamille">@_Okamille</a> <a href="https://twitter.com/e_vignon">@e_vignon</a> <a href="https://twitter.com/cpierrehenri">@cpierrehenri</a> <a href="https://twitter.com/JPro20">@JPro20</a> <a href="https://twitter.com/SoulMathieu">@SoulMathieu</a> <a href="https://twitter.com/Arnault_Chaz">@Arnault_Chaz</a> <a href="https://t.co/RbFuCMG4NS">pic.twitter.com/RbFuCMG4NS</a></p>— Antoine Toubhans (@AntoineToubhans) <a href="https://twitter.com/AntoineToubhans/status/1497254983963660292">February 25, 2022</a></blockquote>
<p>How do you get some DVC swag you ask? Write us some great content, contribute to
our tools, give a presentation at one of our Meetups! We'd love to have you!</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/february-22-community-gemshttps://dvc.org/blog/february-22-community-gemsMon, 28 Feb 2022 00:00:00 GMT<h3 id="how-can-i-delete-dvc-tracked-files-from-cloud-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/927618225989111880" target="_blank" rel="nofollow noopener noreferrer">How can I delete DVC-tracked files from cloud storage?</a><a href="#how-can-i-delete-dvc-tracked-files-from-cloud-storage" aria-label="how can i delete dvc tracked files from cloud storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @fireballpoint1!</p>
<p>You can find the best way to delete files from your cloud storage in
<a href="https://dvc.org/doc/command-reference/gc#removing-data-in-remote-storage" target="_blank" rel="nofollow noopener noreferrer">our docs</a>.
Make sure you're super careful when deleting data from the cloud because it's an
irreversible action. Here's an example of a deletion command that will clear out
everything in your cloud storage <em>except</em> what is referenced in your workspace.:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--cloud</span></span></code></pre></div>
<p>This option only keeps the files and directories referenced in the workspace and
it removes everything else, including data in the cloud and cache. By default,
this command will use the default remote you have set. You can specify a
different remote storage with the <code>--remote</code> option like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--cloud</span> <span class="token parameter variable">--remote</span> name_of_remote</span></code></pre></div>
<h3 id="im-using-dvc-experiments-but-the-git-index-gets-corrupted-with-large-4gb-files-what-is-the-best-workaround" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/928939232033140736" target="_blank" rel="nofollow noopener noreferrer">I'm using DVC experiments, but the Git index gets corrupted with large (4GB) files. What is the best workaround?</a><a href="#im-using-dvc-experiments-but-the-git-index-gets-corrupted-with-large-4gb-files-what-is-the-best-workaround" aria-label="im using dvc experiments but the git index gets corrupted with large 4gb files what is the best workaround permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question from @charles.melby-thompson!</p>
<p>Experiment files may be tracked by Git or DVC. For large files, we generally
recommend tracking them with DVC, in which case file size shouldn't be an issue.</p>
<p>By default, experiments will track all other files with Git. However, Git will
fail with too much data. If there are files you don't want to track at all (such
as large temporary/intermediate files), you can add them to your .gitignore
file.</p>
<p>Check out
<a href="https://github.com/iterative/dvc/issues/6181" target="_blank" rel="nofollow noopener noreferrer">this open issue with experiments</a>
for more details and to provide feedback.</p>
<h3 id="is-there-an-easy-way-to-visualize-dvc-experiment-results-without-using-the-command-line" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/930150143259459644" target="_blank" rel="nofollow noopener noreferrer">Is there an easy way to visualize DVC experiment results without using the command line?</a><a href="#is-there-an-easy-way-to-visualize-dvc-experiment-results-without-using-the-command-line" aria-label="is there an easy way to visualize dvc experiment results without using the command line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question @LucZ[Mad]!</p>
<p>If you bring those experiments into your regular Git workflow, e.g. using
<a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a> to create a branch for any experiment you want to share, you
could use <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> to visualize them.</p>
<p>We're working on support for viewing any pushed experiments in Studio right now
so if there's anything you want to see, make sure to comment on and follow
<a href="https://github.com/iterative/studio-support/issues/45" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p>
<h3 id="can-cml-self-hosted-runners-stop-the-instance-after-the-idle-timeout-instead-of-terminating" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/933674203796873226" target="_blank" rel="nofollow noopener noreferrer">Can CML self-hosted runners stop the instance after the idle timeout instead of terminating?</a><a href="#can-cml-self-hosted-runners-stop-the-instance-after-the-idle-timeout-instead-of-terminating" aria-label="can cml self hosted runners stop the instance after the idle timeout instead of terminating permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is another fantastic question from @jotsif!</p>
<p>No, we deliberately terminate the instance to avoid unexpected costs. Stopped
but unterminated instances
<a href="https://aws.amazon.com/premiumsupport/knowledge-center/ec2-billing-terminated/" target="_blank" rel="nofollow noopener noreferrer">can still cost the same as running ones</a>.
It's best to let the CML runner terminate and create new instances, running
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to restore your data each time.</p>
<p>However, if you're trying to preserve data (e.g. cache dependencies to speed up
experimentation time) on an AWS EC2 instance, you could
<a href="https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/" target="_blank" rel="nofollow noopener noreferrer">connect persistent AWS S3 remote storage</a>.</p>
<h3 id="whats-the-difference-between-dvc-studio-free-and-enterprise-versions" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/933324508570472497" target="_blank" rel="nofollow noopener noreferrer">What's the difference between DVC Studio free and enterprise versions?</a><a href="#whats-the-difference-between-dvc-studio-free-and-enterprise-versions" aria-label="whats the difference between dvc studio free and enterprise versions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for asking @Abdi!</p>
<p>You can find more info about the different
<a href="https://studio.datachain.ai/#pricing" target="_blank" rel="nofollow noopener noreferrer">DVC Studio tiers here</a>.</p>
<p>The <em>Free</em> tier has all the features most individual users need, like connecting
to ML repositories, creating views, submitting experiments, and generating
plots. The <em>Teams</em> tier allows you to create large teams for better
collaboration and sharing of views and settings with everyone. The <em>Enterprise</em>
tier is more for needs around compliance, dedicated support, and on-premise
installation.</p>
<p>If you are trying to decide which plan to select, please email us at
<code>[email protected]</code> and we'll help you figure it out based on your needs.</p>
<h3 id="how-can-i-use-one-dvcyaml-file-with-multiple-pipeline-folders-with-different-paramsyaml-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939099847288578079" target="_blank" rel="nofollow noopener noreferrer">How can I use one <code>dvc.yaml</code> file with multiple pipeline folders with different <code>params.yaml</code> files?</a><a href="#how-can-i-use-one-dvcyaml-file-with-multiple-pipeline-folders-with-different-paramsyaml-files" aria-label="how can i use one dvcyaml file with multiple pipeline folders with different paramsyaml files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>@louisv, thanks for this question!</p>
<p>It seems like you're looking for the parametrization functionality. You can
learn more about how it works
<a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">in this doc</a>,
but here's a an example of what that might look like in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">cleanups</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token comment"># List of simple values</span>
<span class="token punctuation">-</span> raw1
<span class="token punctuation">-</span> labels1
<span class="token punctuation">-</span> raw2
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> clean.py "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>"
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.cln</code></pre></div>
<h3 id="is-it-possible-to-change-the-x-label-in-dvc-studio" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/938857004187943003" target="_blank" rel="nofollow noopener noreferrer">Is it possible to change the x-label in DVC Studio?</a><a href="#is-it-possible-to-change-the-x-label-in-dvc-studio" aria-label="is it possible to change the x label in dvc studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A great question about Studio from @PythonF!</p>
<p>You can set custom properties for your plot in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">plots</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">plots_no_cache.csv</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">x</span><span class="token punctuation">:</span> r</code></pre></div>
<p>You can also use <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots modify</code></a> to change the x-label or y-label for your
plots using commands similar to the following.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots modify</span> plots_no_cache.csv <span class="token parameter variable">-x</span> r <span class="token parameter variable">-y</span> q</span></code></pre></div>
<hr>
<p><img src="https://media.giphy.com/media/h5Ct5uxV5RfwY/giphy.gif" alt="Done Tyler The Creator GIF"></p>
<p>At our March Office Hours Meetup we will be about how you can create, run, and
benchmark DVC pipelines with <a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">ZnTrack</a>!
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/february-22-heartbeathttps://dvc.org/blog/february-22-heartbeatThu, 17 Feb 2022 00:00:00 GMT<details>
<p>This month's Heartbeat image is inspired by Community member Daniel Barnes.<br>
Daniel has been a great contributor to CML and helps out folks with questions in
Discord as well as frequently attends our Meetups. This image is inspired from
his
<a href="https://app.orbit.love/dvc-community/members/danielbarnes" target="_blank" rel="nofollow noopener noreferrer">GitHub profile image</a>
and the fact that he used to be a competitive paraglider. His record being 9.5
hours in the air! 😳 Many thanks to Daniel for his contributions to the
Community that keeps us all flying high! 🪂</p>
<summary>✨Image Inspo✨</summary>
</details>
<h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/d3mn5mnDkwECLmnK/giphy.gif" alt="Stranger Things Math GIF by Wetpaint"></p>
<p>The year is already flying by! Check out what's new this month!</p>
<h2 id="fuzzylabs-open-source-mlops-is-awesome" style="position:relative;">FuzzyLabs Open Source MLOps is Awesome<a href="#fuzzylabs-open-source-mlops-is-awesome" aria-label="fuzzylabs open source mlops is awesome permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>So let me guess, still overwhelmed with MLOps tool choices? This past month
<a href="https://www.linkedin.com/in/matt-squire-a19896125/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Squire</strong></a> of
<a href="http://FuzzyLabs.ai" target="_blank" rel="nofollow noopener noreferrer">Fuzzy Labs.ai</a> reviewed their
<a href="github.com/fuzzylabs/awesome-open-mlops">Awesome Open Source MLOps repo,</a>
<a href="https://fuzzylabs.ai/blog/open-source-mlops-is-awesome/" target="_blank" rel="nofollow noopener noreferrer">in this blog</a> and
<a href="https://youtu.be/HIAPoKEDXrg" target="_blank" rel="nofollow noopener noreferrer">this video</a>. Matt breaks down the tool space into
categories of SaaS platforms, fully open source tools, and partly open source
tools. He describes how they define open source and why they think open source
is the best choice in the MLOps space, which includes its trait of being
<em>flexible</em>, <em>ownable</em>, <em>cost-effective</em>, and <em>agile</em>.</p>
<blockquote>
<p>"Turn key solutions quickly become inflexible." - Matt Squire</p>
</blockquote>
<p>Fuzzy Labs, a small AI company in Manchester, England, had a need for
flexibility in their work with their clients, so they did a deep dive into MLOps
tooling and established an MLOps Platform meeting the open source and flexible
criteria they required. This stack includes our own <em>DVC</em>, as well as
<a href="https://github.com/IDSIA/sacred" target="_blank" rel="nofollow noopener noreferrer">Sacred</a>, <a href="https://zenml.io/" target="_blank" rel="nofollow noopener noreferrer">ZenML</a>,
<a href="https://www.seldon.io/tech/products/core" target="_blank" rel="nofollow noopener noreferrer">Seldon Core</a>, and
<a href="https://evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently AI.</a></p>
<p>The blog and the video are definitely good material to review if you're choosing
your ML tools.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/HIAPoKEDXrg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights--biases-implementation-and-converting-weights-to-onnx" style="position:relative;">Continuous Machine Learning on Huggingface Transformer with DVC including Weights & Biases Implementation and Converting Weights to ONNX.<a href="#continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights--biases-implementation-and-converting-weights-to-onnx" aria-label="continuous machine learning on huggingface transformer with dvc including weights biases implementation and converting weights to onnx permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As the title would suggest,
<a href="https://medium.com/@arjunkumbakkara/continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights-biases-1bc4520d210" target="_blank" rel="nofollow noopener noreferrer">this jam packed article</a>
from <a href="https://github.com/nabarunbaruaAIML" target="_blank" rel="nofollow noopener noreferrer"><strong>Nabarun Barua</strong></a>, and
<a href="https://github.com/arjunKumbakkara" target="_blank" rel="nofollow noopener noreferrer"><strong>Arjun Kumbakkara</strong></a> focuses in on how CML
can be implemented into an NLP project. They assume knowledge of DVC,
Transformers, ONNX and Weights & Biases, so be ready to take your skills to the
next level automating parts of the process with CML.</p>
<p>They begin with the all-important setups of AWS IAM user with EC2 & S3 Developer
access, the S3 bucket to store the dataset, and requesting an EC2 spot instance.
They then continue into a detailed description of all the stages of the project,
outlining the use of all the tools including DVC Studio. You can find
<a href="https://github.com/nabarunbaruaAIML/CML_with_DVC_on_Transformer_NLP" target="_blank" rel="nofollow noopener noreferrer">the repo for the project here.</a>
Looking forward to the next installment from Nabarun and Arjun on a Dockerized
Container Application cluster with Kubernetes Orchestration. 🍿</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5e13082e5c6a0afb25bd2a81396f76df/39600/arjun-kumbakkara-architecture.png" alt="Training, Deployment and Retraining Architecture" title="Training, Deployment and Retraining Architecture" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Total architecture with the Training, Deployment, and Retraining Pipelines in
the same order.
(<a href="https://medium.com/@arjunkumbakkara/continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights-biases-1bc4520d210" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="dvc-used-to-help-extract-knowledge-from-covid-19-research" style="position:relative;">DVC Used to help extract knowledge from COVID-19 research<a href="#dvc-used-to-help-extract-knowledge-from-covid-19-research" aria-label="dvc used to help extract knowledge from covid 19 research permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In case you missed it in our
<a href="https://twitter.com/ivanovitchm/status/1482742970461863939?s=20&t=QrfDTRHcZOKWIe5n5mb7ZQ" target="_blank" rel="nofollow noopener noreferrer">Twitter feed</a>,
a group of scientists
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y" target="_blank" rel="nofollow noopener noreferrer">published an article</a>
in <a href="https://link.springer.com/journal/11192" target="_blank" rel="nofollow noopener noreferrer">Scientometrics Journal</a> entitled,
<em>Discovering temporal scientometric knowledge in COVID-19 scholarly production</em>.
The authors,
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Breno_Santana-Santos" target="_blank" rel="nofollow noopener noreferrer"><strong>Breno Santana Santos</strong></a>,
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Ivanovitch-Silva" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivanovitch Silva</strong></a>,
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Luciana-Lima" target="_blank" rel="nofollow noopener noreferrer"><strong>Luciana Lima</strong></a>,
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Patricia_Takako-Endo" target="_blank" rel="nofollow noopener noreferrer"><strong>Patricia Takako Endo</strong></a>,
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Gisliany-Alves" target="_blank" rel="nofollow noopener noreferrer"><strong>Gisliany Alves</strong></a>,
&
<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Marcel_da_C_mara-Ribeiro_Dantas" target="_blank" rel="nofollow noopener noreferrer"><strong>Marcel da Câmara Ribeiro-Dantas</strong></a>,
used DVC to create a reproducible workflow that combined machine learning and
Complex Network Analysis techniques to extract implicit and temporal knowledge
from Scientific production bases on COVID-19.</p>
<blockquote>
<p>"The presented methodology has the potential to instrument and expand
strategic and proactive decisions of the scientific community aiming at
knowledge extraction that supports the fight against the pandemic."</p>
</blockquote>
<p>We are so happy to be helpful in the fight against the pandemic! Be sure to
check out the paper and keep your eyes out for a Meetup in the future where they
present this work!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/70096f540bd1c2c3c5cbf29aca5b187b/39600/scientometric.png" alt="DVC in Scientometric Covid Research" title="DVC in Scientometric Covid Research" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Discovering temporal scientometric knowledge in COVID-19 scholarly production
(<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="github-goodness-and-integrations" style="position:relative;">GitHub Goodness and Integrations<a href="#github-goodness-and-integrations" aria-label="github goodness and integrations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<ul>
<li>
<p>If you're a <a href="https://guild.ai/" target="_blank" rel="nofollow noopener noreferrer"><strong>Guild.Ai</strong></a> user, you'll be happy to know
that Guild now supports DVC! Find out more in
<a href="https://my.guild.ai/t/using-guild-ai-with-dvc/803" target="_blank" rel="nofollow noopener noreferrer">this article</a> by
<a href="https://www.linkedin.com/in/gar1t/" target="_blank" rel="nofollow noopener noreferrer"><strong>Garret Smith</strong></a>and the
<a href="https://github.com/guildai/guildai/tree/dvc/examples/dvc" target="_blank" rel="nofollow noopener noreferrer">corresponding repo</a>
for an example.</p>
</li>
<li>
<p><a href="https://github.com/lucmos" target="_blank" rel="nofollow noopener noreferrer"><strong>Luca Moschella</strong></a> created
<a href="https://github.com/grok-ai/nn-template" target="_blank" rel="nofollow noopener noreferrer">this <strong>NN template</strong></a> for your neural
network projects where you want to combine PyTorch Lightning, Hydra, DVC,
Weights and Biases and Streamlit.</p>
</li>
<li>
<p>Just a reminder for your NLP projects, <a href="https://spacy.io/" target="_blank" rel="nofollow noopener noreferrer"><strong>SpaCy</strong></a>
integrates with DVC as well. You can find out more info on
<a href="https://spacy.io/usage/projects#integrations" target="_blank" rel="nofollow noopener noreferrer">the integration here.</a></p>
</li>
</ul>
<p><img src="https://media.giphy.com/media/13zeE9qQNC5IKk/giphy.gif" alt="Seal Of Approval Thumbs Up GIF"></p>
<h1 id="in-other-data-science-and-ai-news" style="position:relative;">In Other Data Science and AI News<a href="#in-other-data-science-and-ai-news" aria-label="in other data science and ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="10-most-important-jobs-for-ml-products-in-2022" style="position:relative;">10 Most Important Jobs for ML Products in 2022<a href="#10-most-important-jobs-for-ml-products-in-2022" aria-label="10 most important jobs for ml products in 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9c2c020df1b6701551c734955fb0837/39600/roles-in-ai.png" alt="10 Most Important Jobs for ML Products in 2022" title="10 Most Important Jobs for ML Products in 2022 =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
People new to the data science/ml space are often overwhelmed by all that there
is to learn, and determining the path to get there. When I get this question
from Community members, I always have the same advice: try to figure out what
part of DS/AI is most interesting to you and then work to building your skills
toward that. In this article on the
<a href="https://medium.datadriveninvestor.com/the-10-most-important-jobs-for-ml-products-in-2022-7bf844d62423" target="_blank" rel="nofollow noopener noreferrer">10 Most Important Jobs for ML Products in 2022</a>,
<a href="https://www.linkedin.com/in/agoston-torok/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ágoston Török</strong></a> does a great job
of defining the different roles in the space, how they interrelate, and how they
show up in AI companies in the product development process. See his breakdown of
the roles above, with rows defining the stage, and columns, the aspects the
roles focus on. If you find you are drawn to the space where the DS prototypes
become the software product, then you may want to check out
<a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">our new course!</a> 😉</p>
<h2 id="engineering-best-practices-for-machine-learning" style="position:relative;">Engineering Best Practices for Machine Learning<a href="#engineering-best-practices-for-machine-learning" aria-label="engineering best practices for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Diving deeper into these roles, the team was a buzz recently, reviewing
<a href="https://se.ewi.tudelft.nl/remla/slides/07_ASerban_mleng_practices.pdf" target="_blank" rel="nofollow noopener noreferrer">this slide deck</a>
on <em>Engineering Best Practices for Machine Learning</em> by
<a href="https://www.linkedin.com/in/serbanac/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Serban</strong></a>. In it Alex discusses
the challenges of creating software from machine learning projects, the
differences between these projects and traditional software development, and the
need for developing robust and ethical practices. He and his colleagues,
<a href="https://liacs.leidenuniv.nl/~blomkvander/" target="_blank" rel="nofollow noopener noreferrer"><strong>Koen van der Blom</strong></a>,
<a href="https://ada.liacs.nl/members/" target="_blank" rel="nofollow noopener noreferrer"><strong>Holger Hoos</strong></a>, and
<a href="https://jstvssr.github.io/" target="_blank" rel="nofollow noopener noreferrer"><strong>Joost Visser</strong></a> created a survey to determine
current adoption of best practices in the industry. Along with the great review
of the survey results in the slides, a number of resources were provided
including
<a href="https://github.com/SE-ML/awesome-seml/blob/master/readme.md" target="_blank" rel="nofollow noopener noreferrer">the corresponding Awesome list, </a>
a
<a href="https://se-ml.github.io/practices/" target="_blank" rel="nofollow noopener noreferrer">Catalog of Best ML Engineering Practices</a>,
and their <a href="https://se-ml.github.io/" target="_blank" rel="nofollow noopener noreferrer">project website</a> for more information on
the whole project. Definitely worth your review! ✅</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3571466bc128ab349dca2ab39d07161/39600/alex-serban.png" alt="Engineering Best Practices for Machine Learning" title="Engineering Best Practices for Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>29 Machine Learning Engineering practices ranked by adoption
(<a href="https://se.ewi.tudelft.nl/remla/slides/07_ASerban_mleng_practices.pdf" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="twine-ethical-datasets" style="position:relative;">Twine Ethical Datasets<a href="#twine-ethical-datasets" aria-label="twine ethical datasets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Are you in need of ethically sourced audio or video data for your ML project?
<a href="https://www.twine.net/ai" target="_blank" rel="nofollow noopener noreferrer">Twine</a> has created a way to accomplish this, while
simultaneously freeing ML teams of the project management lift associated with
the collection of these datasets.<br>
You can learn more about Twine's efforts in ethical data collection through
these articles,
<a href="https://www.twine.net/blog/the-importance-of-ethically-sourced-data/" target="_blank" rel="nofollow noopener noreferrer">The Importance of Ethically Sourced Data,</a>
<a href="https://www.twine.net/blog/bias-in-data-collection/" target="_blank" rel="nofollow noopener noreferrer">Bias in Data Collection, </a>
<a href="https://www.twine.net/blog/diversity-data-inclusive-workforce/" target="_blank" rel="nofollow noopener noreferrer">Collecting Diversity Data: How to Ensure an Inclusive Workforce,</a>
and
<a href="https://www.twine.net/blog/the-hidden-costs-of-bad-data/" target="_blank" rel="nofollow noopener noreferrer">The Hidden Costs of Bad Data.</a>
Twine also provides
<a href="https://www.twine.net/blog/100-audio-and-video-datasets/" target="_blank" rel="nofollow noopener noreferrer">100 open audio and video datasets</a>
for anyone working on these types of projects. Check it out! 👇🏽</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.twine.net/blog/100-audio-and-video-datasets/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Twine Ethically Sourced Datasets</h4>
<div class="elp-description">100 Ethically sourced audio and video datasets from Twine.</div>
<div class="elp-link">https://twine.net/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-02-17/twine-b098886b6287a5d276c534bb8c2de293.png" alt="Twine Ethically Sourced Datasets">
</div>
</a>
</section>
<p></p>
<h2 id="batterydev-hackathon-2022" style="position:relative;">BatteryDEV Hackathon 2022<a href="#batterydev-hackathon-2022" aria-label="batterydev hackathon 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Are you interested in battery technology and in participating in a Hackathon
using battery data? The
<a href="https://www.tfir.io/how-experiment-versioning-is-going-to-solve-big-problems-of-ai-ml-world/" target="_blank" rel="nofollow noopener noreferrer">growth of battery technology</a>
is climbing quickly as the world is looking to solve some of the world's
emissions issues with electronic vehicles. Additionally the demand for electric
vehicles
<a href="https://www.mckinsey.com/business-functions/operations/our-insights/unlocking-growth-in-battery-cell-manufacturing-for-electric-vehicles" target="_blank" rel="nofollow noopener noreferrer">is outpacing</a>
the manufacturers' ability to supply the needed batteries. Datasets in the space
are kept proprietary as companies work independently to develop patents.
BatteryDEV 2022 aims to accelerate battery innovation through open source
competitions. This year they are expecting 300 participants for the event from
March 20-26. Community member
<a href="https://www.linkedin.com/in/raymond-james-gasper/" target="_blank" rel="nofollow noopener noreferrer">Raymond Gasper</a> is one of
the organizers of <a href="https://battery.dev" target="_blank" rel="nofollow noopener noreferrer">Battery.dev</a>, and is creating a DVC
template for participants to use during the Hackathon. You can
<a href="https://www.battery.dev/registration-form" target="_blank" rel="nofollow noopener noreferrer">register for the event here!</a></p>
<p>
</p><section class="elp-content-holder">
<a href="https://battery.dev" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">BatteryDEV 2022 Hackathon</h4>
<div class="elp-description">A global innovation challenge for battery, data and machine learning enthusiasts.</div>
<div class="elp-link">https://battery.dev/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-02-17/battery-dev-f0777718a6d186b28446066b3f901cc4.png" alt="BatteryDEV 2022 Hackathon">
</div>
</a>
</section>
<p></p>
<h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> talked to
<a href="https://twitter.com/SwapBhartiya" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> recently about how
experiment versioning can help to solve the big problems of the AI/ML world. In
this interview you will learn how experiment versioning tracks everything you
need for a particular experiment so that the result is reproducible from
prototyping to production. This solution enables data science and engineering
teams to work more productively together.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/y5zp54LiAqg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="march-office-hours" style="position:relative;">March Office Hours!<a href="#march-office-hours" aria-label="march office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Be sure to join us at the
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">March Office Hours Meetup,</a>
where <a href="https://github.com/PythonFZ/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabian Zills</strong>,</a> PhD student at
<a href="https://www.uni-stuttgart.de/en/" target="_blank" rel="nofollow noopener noreferrer">University of Stuttgart,</a> will present his
ZnTrack project which creates, runs and benchmarks DVC pipelines in Python and
Jupyter Notebooks.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">March Office Hours - ZnTrack</h4>
<div class="elp-description">RSVP for DVC Office Hours - ZnTrack - Create, Visualize, Run and Benchmark DVC Pipelines in Python & Jupyter Notebooks </div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-02-17/office-hours-meetup-39d6c71b2928c57d1858c4544400dffc.png" alt="March Office Hours - ZnTrack">
</div>
</a>
</section>
<p></p>
<h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are extremely excited to welcome our new Director of Engineering,
<a href="https://www.linkedin.com/in/odedmesser/" target="_blank" rel="nofollow noopener noreferrer"><strong>Oded Messer</strong></a>. Oded lives in Israel
and plans to pour his time and attention into the people/processes/structures of
the engineering org to facilitate healthy growth and culture.💗 He brings
hands-on and managerial industry experience in the backend/tooling/infra and
MLOps domains (ex. Intel and Iguazio). In his spare time Oded remembers
traveling being a favorite activity, and also admits to being a sci-fi geek.
He's in good company here! 😉</p>
<p>We welcome <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> who joins us as a
Field Data Scientist from Montreal, Canada. Alex's previous professional
experience has been at the intersection of Software Engineering and Data Science
across a few different industries. He has also done consulting work to develop
Data Science curriculums for EdTech companies. Alex speaks Russian and a little
French in addition to English. In his free time, Alex likes to bake, his
specialty being pizza! 🍕</p>
<details>
<p>We now have three Alex's on the team to match our three Davids!</p>
<summary>🎉Fun Fact!</summary>
</details>
<p><a href="https://github.com/jesper7" target="_blank" rel="nofollow noopener noreferrer"><strong>Jesper Svendsen</strong></a> joins the team as a Platform
Engineer from Denmark.<br>
Previously, Jesper worked as an SRE for Evaxion Biotech (another ML-driven
company). Prior to that, he was a self-employed IT consultant, where he did
full-stack development. Jesper's hobbies include reading books, (particularly
medicine and psychology books), weightlifting, running, and photography. 📸</p>
<details>
<p>Jesper makes the eighth employee joining <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.AI</a>
with a name starting with the letter 'j.' I thought this was odd, as words that
start with 'j' have one of the
<a href="https://funbutlearn.com/2012/06/which-english-letter-has-maximum-words.html" target="_blank" rel="nofollow noopener noreferrer">lowest frequencies in the English language</a>.
But as it turns out, 'J' is
<a href="https://www.quora.com/What-letter-of-the-English-alphabet-are-used-most-as-the-first-letter-of-the-first-name" target="_blank" rel="nofollow noopener noreferrer">one of the more common first initials.</a></p>
<summary>🎉Fun Fact!</summary>
</details>
<p><a href="https://github.com/erudin" target="_blank" rel="nofollow noopener noreferrer"><strong>Gabriella Caraballo</strong></a> joins Iterative as a
Backend Engineer. She is originally from Venezuela, but is currently living in
Canada! Programming was a hobby that became a professional path for Gabriella.
She loves everything related to security, privacy and open source. In her free
time, Gabriella enjoys cooking and eating, playing video/board games,
crocheting, photography, and music. Now that she's in Canada, she has added
skiing to her hobbies! ⛷</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with these amazing new additions to the team, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions and share with anyone you think may be
interested! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">With tools like <a href="https://twitter.com/DVCorg">@DVCorg</a> & <a href="https://twitter.com/TheRealDagsHub">@TheRealDAGsHub</a> you can easily share , review & reproduce/reuse your work. <br><br>Just like how git makes software development smooth for software developers that's how tools like DVC make reproducibility smooth for ML Engineers.</p>— Gift Ojeabulu (@GiftOjeabulu_) <a href="https://twitter.com/GiftOjeabulu_/status/1490771330949599234">February 7, 2022</a></blockquote>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/january-22-community-gemshttps://dvc.org/blog/january-22-community-gemsMon, 31 Jan 2022 00:00:00 GMT<h3 id="is-it-possible-to-stream-objects-to-and-from-remote-caches" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/919567459189682177" target="_blank" rel="nofollow noopener noreferrer">Is it possible to stream objects to and from remote caches?</a><a href="#is-it-possible-to-stream-objects-to-and-from-remote-caches" aria-label="is it possible to stream objects to and from remote caches permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for asking @mihaj!</p>
<p>You can stream files using the <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC API</a>.
There are two methods that you'll likely want to check out. First there's
<a href="https://dvc.org/doc/api-reference/open"><code>dvc.api.open()</code></a>. This opens a file tracked by DVC and generates a corresponding
file object. Here's a quick example:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api
<span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>
<span class="token string">'get-started/data.xml'</span><span class="token punctuation">,</span>
repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/dataset-registry'</span>
<span class="token punctuation">)</span> <span class="token keyword">as</span> fd<span class="token punctuation">:</span>
<span class="token comment"># do things with the file object here</span></code></pre></div>
<p>The simplest way to return the contents from a DVC tracked file would be to use
<a href="https://dvc.org/doc/api-reference/read"><code>dvc.api.read()</code></a>. The returned content can be a bytearray or string. Here's a
little example of this being used:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> pickle
<span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api
model <span class="token operator">=</span> pickle<span class="token punctuation">.</span>loads<span class="token punctuation">(</span>
dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>read<span class="token punctuation">(</span>
<span class="token string">'model.pkl'</span><span class="token punctuation">,</span>
repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/example-get-started'</span>
mode<span class="token operator">=</span><span class="token string">'rb'</span>
<span class="token punctuation">)</span>
<span class="token punctuation">)</span></code></pre></div>
<h3 id="one-of-the-steps-in-my-dvc-pipeline-uses-a-pip-installed-package-what-is-the-best-way-to-make-sure-that-dvc-re-runs-the-steps-that-depend-on-that-package" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/920139825284280381" target="_blank" rel="nofollow noopener noreferrer">One of the steps in my DVC pipeline uses a <code>pip</code> installed package. What is the best way to make sure that DVC re-runs the steps that depend on that package?</a><a href="#one-of-the-steps-in-my-dvc-pipeline-uses-a-pip-installed-package-what-is-the-best-way-to-make-sure-that-dvc-re-runs-the-steps-that-depend-on-that-package" aria-label="one of the steps in my dvc pipeline uses a pip installed package what is the best way to make sure that dvc re runs the steps that depend on that package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @alphaomega!</p>
<p>The best way to handle any package dependencies is to include a
<code>requirements.txt</code> file with the specific versions your pipeline needs.</p>
<p>Another approach you can take is having a stage that dumps the package version
as an intermediate output. It doesn't have to be saved in Git or DVC because
it's easily reproduced and DVC should be able to take care of detecting that the
package didn't change. Here's an example of a stage that does this.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">package_version</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> pip freeze <span class="token punctuation">|</span> grep "package_name==" <span class="token punctuation">></span> package_name_version.txt
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> package_name_version.txt
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> package_name_version.txt</code></pre></div>
<h3 id="does-dvc-save-dependencies-which-are-in-the-dvcyaml-pipeline-to-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/920659549835370497" target="_blank" rel="nofollow noopener noreferrer">Does DVC save dependencies which are in the <code>dvc.yaml</code> pipeline to the cache?</a><a href="#does-dvc-save-dependencies-which-are-in-the-dvcyaml-pipeline-to-the-cache" aria-label="does dvc save dependencies which are in the dvcyaml pipeline to the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for another great question @rie!</p>
<p>DVC doesn't track the pipeline dependencies in the cache or storage, only the
outputs. If you want DVC to track a pure data dependency that's not an output of
a different stage, you need to track it with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add ...</code></a></p>
<p>The output of a pipeline might be something like <code>data.dvc</code>, while a pure
dependency might be a file that's just a part of the project, like <code>script.py</code>.
That's why you'll need to use the <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> command to track this.</p>
<h3 id="what-is-the-difference-between-kubeflow-pipelines-and-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/922728960478035978" target="_blank" rel="nofollow noopener noreferrer">What is the difference between Kubeflow pipelines and DVC pipelines?</a><a href="#what-is-the-difference-between-kubeflow-pipelines-and-dvc-pipelines" aria-label="what is the difference between kubeflow pipelines and dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a fantastic question! Thanks for asking @ramakrishnamamidi!</p>
<p>A major difference is that DVC focuses primarily on ML <em>development</em> and adding
lightweight functionality on top of existing projects, which may be reusable in
deployment in some cases.</p>
<p>Kubeflow focuses on <em>deployment</em> and building on top of Kubernetes, which could
be used during development but requires more up-front effort.</p>
<h3 id="could-dvc-be-a-good-alternative-to-lfs-for-game-development" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485586884165107734/928336349487067196" target="_blank" rel="nofollow noopener noreferrer">Could DVC be a good alternative to LFS for game development?</a><a href="#could-dvc-be-a-good-alternative-to-lfs-for-game-development" aria-label="could dvc be a good alternative to lfs for game development permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for such an interesting question @CB!</p>
<p>Yes! We have community members that use DVC to handle their large files in game
development.</p>
<p>There are several other use cases we've seen for DVC outside of machine learning
and data science. Some people have used DVC to track build artifacts for
deployment systems and to track performance data alongside design iterations and
simulation tools.</p>
<p>You should check out our
<a href="https://discord.com/channels/485586884165107732/918159153824952320" target="_blank" rel="nofollow noopener noreferrer">#beyond-ml</a>
Discord channel to stay up to date with the other use cases the community is
coming p with!</p>
<h3 id="does-dvc-run-on-jsonyaml-configuration-files-for-all-things" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/928779586622332938" target="_blank" rel="nofollow noopener noreferrer">Does DVC run on JSON/YAML configuration files for all things?</a><a href="#does-dvc-run-on-jsonyaml-configuration-files-for-all-things" aria-label="does dvc run on jsonyaml configuration files for all things permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a great question about large projects with a lot of dependencies from
@SolemnSimulacrum!</p>
<p>All of the dependencies you list in <code>dvc run</code> are in fact configured in the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. <code>dvc run</code> is a convenience for adding a pipeline stage to this
file and then doing <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> on that stage. It's completely acceptable and
even encouraged to directly edit <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> if that's easier.</p>
<p>For example, if you are currently executing a command like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> prune <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> model.pt <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./DepFiles_0/ <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./DepFiles_1/ <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./DepFiles_2/ <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./src/.py <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./packages/.py <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./scripts/.py <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> ./data/.npy <span class="token punctuation">\</span>
python script.py</span></code></pre></div>
<p>You could add those directly to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">prune</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python script.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> ./DepFiles_0/
<span class="token punctuation">-</span> ./DepFiles_1/
<span class="token punctuation">-</span> ./DepFiles_2/
<span class="token punctuation">-</span> ./src/.py
<span class="token punctuation">-</span> ./packages/.py
<span class="token punctuation">-</span> ./scripts/.py
<span class="token punctuation">-</span> ./data/.npy
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> model.pt</code></pre></div>
<h3 id="im-setting-up-mlops-at-my-company-from-scratch-and-we-use-gitlab-and-cloudera-ds-workbench-what-are-the-best-resources-to-get-started-with-cml" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/923785806848614461" target="_blank" rel="nofollow noopener noreferrer">I'm setting up MLOps at my company from scratch and we use GitLab and Cloudera DS workbench. What are the best resources to get started with CML?</a><a href="#im-setting-up-mlops-at-my-company-from-scratch-and-we-use-gitlab-and-cloudera-ds-workbench-what-are-the-best-resources-to-get-started-with-cml" aria-label="im setting up mlops at my company from scratch and we use gitlab and cloudera ds workbench what are the best resources to get started with cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a great question from @dvc!</p>
<p>We recommend you start with the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML docs website</a>.</p>
<p>You can find some tutorials on <a href="https://dvc.org/blog" target="_blank" rel="nofollow noopener noreferrer">our blog</a>.</p>
<p>Or you can check out the videos on
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">our YouTube channel</a></p>
<p>And of course, you can always ask questions in the Discord community!</p>
<h3 id="i-understand-that-dvc-studio-is-a-discoverability-layer-over-my-dvc-repo-in-github-will-any-of-my-data-be-stored-on-your-servers" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/923714473603256420" target="_blank" rel="nofollow noopener noreferrer">I understand that DVC Studio is a discoverability layer over my DVC repo in GitHub. Will any of my data be stored on your servers?</a><a href="#i-understand-that-dvc-studio-is-a-discoverability-layer-over-my-dvc-repo-in-github-will-any-of-my-data-be-stored-on-your-servers" aria-label="i understand that dvc studio is a discoverability layer over my dvc repo in github will any of my data be stored on your servers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a great question about DVC Studio from @johnnyaug!</p>
<p>DVC Studio only stores metrics, plots, and metadata about your pipelines in the
databases to be able to serve this as a table. We don't read actual data and we
don't store code.</p>
<p>An important thing to note is that if you have plots from <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a> that
are images, JSON files, or vega specs, those could be saved on our end as well
to serve them to UI.</p>
<p>We're working on documentation for this as well!</p>
<hr>
<p><img src="https://media.giphy.com/media/zCME2Cd20Czvy/giphy.gif" alt="The Lord Of The Rings GIF"></p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/January-22-heartbeathttps://dvc.org/blog/January-22-heartbeatTue, 18 Jan 2022 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Happy New Year! Hope you got some good rest and stayed healthy at the end of
2021, because 2022 has lots of great things in store!</p>
<p><img src="https://media.giphy.com/media/7ILAGpJWoQYWA0j60C/giphy.gif" alt="Heartbeat!"></p>
<h2 id="diego-jardim---mlops-a-complete-hands-on-introduction" style="position:relative;">Diego Jardim - MLOps: A Complete Hands-On Introduction<a href="#diego-jardim---mlops-a-complete-hands-on-introduction" aria-label="diego jardim mlops a complete hands on introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://poatek.com/2021/12/20/mlops-a-complete-and-hands-on-introduction-part-1/" target="_blank" rel="nofollow noopener noreferrer">In Part 1</a>
of his two-part series,
<a href="https://www.linkedin.com/in/diegosevero/" target="_blank" rel="nofollow noopener noreferrer"><strong>Diego Jardim</strong></a> of
<a href="https://poatek.com/" target="_blank" rel="nofollow noopener noreferrer">Poatek</a> takes us through the basics of MLOps and the
stages of implementation and maturity of an MLOps pipeline. He closes by
introducing us to some tools to help a team progress through these stages, which
include DVC and CML.</p>
<p><a href="https://poatek.com/2021/12/29/mlops-a-complete-and-hands-on-introduction-part-2/" target="_blank" rel="nofollow noopener noreferrer">In Part 2</a>
he delves into more detail and code on how to set up version control of
everything with DVC as well as automation of experimentation and reporting with
CML. Finally, he uses FastAPI and Heroku for model serving and deployment. You
can find all the scripts for the project in
<a href="https://github.com/dsjardim/fraud-detection-mlops" target="_blank" rel="nofollow noopener noreferrer">this GitHub repository.</a></p>
<p>
</p><section class="elp-content-holder">
<a href="https://poatek.com/2021/12/29/mlops-a-complete-and-hands-on-introduction-part-2/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps: A Complete Hands-On Tutorial</h4>
<div class="elp-description">In his 2-part series, Diego Jardim of Paotek introduces concepts and stages of MLOps and provides a tutorial on how to create an MLOps pipeline.</div>
<div class="elp-link">https://poatek.com/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-01-18/diego-jardim-11d746cd4b5f1f2f63503c8ce444a180.png" alt="MLOps: A Complete Hands-On Tutorial">
</div>
</a>
</section>
<p></p>
<h2 id="carl-w-handlin-wallace---reproducible-data-science-and-why-it-matters" style="position:relative;">Carl W. Handlin Wallace - Reproducible Data Science and Why it Matters<a href="#carl-w-handlin-wallace---reproducible-data-science-and-why-it-matters" aria-label="carl w handlin wallace reproducible data science and why it matters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/carlhandlin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Carl W. Handlin Wallace</strong></a> of
<a href="https://www.rappibank.pe/" target="_blank" rel="nofollow noopener noreferrer">RappiBank</a> wrote a
<a href="https://medium.com/rappibank/reproducible-data-science-and-why-it-matters-e4e62fd60b9a/" target="_blank" rel="nofollow noopener noreferrer">great article</a>
for their company <a href="https://medium.com/" target="_blank" rel="nofollow noopener noreferrer">Medium</a> profile on the importance of
reproducibility, AKA replicability, in science in general, and the challenges in
Data Science in particular. As he points out, from
<a href="https://doi.org/10.1038/533452a" target="_blank" rel="nofollow noopener noreferrer">Nature's survey,</a> over half of all researchers
have failed to reproduce even their own work, let alone that of another
scientist. While initiatives like
<a href="https://paperswithcode.com/" target="_blank" rel="nofollow noopener noreferrer">Papers With Code</a> are helping to encourage
reproducibility in the industry, there's still work to be done. He notes DVC as
a part of the solution to this problem along with other tools to round out the
whole picture. Check out the article for good food for thought and other
resources!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c52351436c2bfd17633dbb36d3dfd200/39600/carl-handlin-rappibank.png" alt="Proposed Reproducibility Framework for Data Science" title="Proposed Reproducibility Framework for Data Science" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p><em>Carl W. Handlin Wallace's Proposed Reproducibility Framework for Data Science
(<a href="https://medium.com/rappibank/reproducible-data-science-and-why-it-matters-e4e62fd60b9a/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="abid-ali-awan---tips--tricks-of-deploying-deep-learning-webapp-on-heroku-cloud" style="position:relative;">Abid Ali Awan - Tips & Tricks of Deploying Deep Learning Webapp on Heroku Cloud<a href="#abid-ali-awan---tips--tricks-of-deploying-deep-learning-webapp-on-heroku-cloud" aria-label="abid ali awan tips tricks of deploying deep learning webapp on heroku cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 450px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2a68cbe567a7492f57b697eb6bbf9273/39600/abid-ali-awan.png" alt="DVC Heroku Integration" title="Heroku Hidden Tricks =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p><a href="https://www.linkedin.com/in/1abidaliawan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Abid Ali Awan</strong>'s</a>
<a href="https://www.kdnuggets.com/2021/12/tips-tricks-deploying-dl-webapps-heroku.html" target="_blank" rel="nofollow noopener noreferrer">article in KDNuggets</a><br>
guides
you on how to create a smooth process to deploy a deep learning web application
with Heroku. In the guide, he covers integration with DVC and optimizing storage
using Docker, Git & CLI-based deployment, how to deal with error code H10, and
tweaking Python packages to stay within the 500 MB Heroku limitation. If you've
been looking for a way to create a deep learning web app, this may help!</p>
<h2 id="amit-kulkarni---overview-of-mlops-with-open-source-tools" style="position:relative;">Amit Kulkarni - Overview of MLOps with Open Source Tools<a href="#amit-kulkarni---overview-of-mlops-with-open-source-tools" aria-label="amit kulkarni overview of mlops with open source tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the very
<a href="https://www.analyticsvidhya.com/blog/2022/01/overview-of-mlops-with-open-source-tools/" target="_blank" rel="nofollow noopener noreferrer"><strong>FIRST</strong> tutorial of DVC Studio</a>
from the Community,
<a href="http://www.linkedin.com/in/amitvkulkarni2" target="_blank" rel="nofollow noopener noreferrer"><strong>Amit Kulkarni</strong></a> reviews the set
up process of DVC Studio and MLFlow and their ability to ease the operational
aspects of machine learning teams by providing a clear way to solve the
formidable task of tracking all the factors that go into the iterative process.
Amit covers the easy setup process, adding a view, model comparison, and running
experiments from the DVC Studio UI.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a4a6d7e537e16595bcd3e6f92afae851/39600/amit-kulkarni-studio.png" alt="DVC Studio Experiment Tracker UI" title="DVC Studio Experiment Tracker UI" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Amit Kulkarni's DVC Studio tutorial
(<a href="https://www.analyticsvidhya.com/blog/2022/01/overview-of-mlops-with-open-source-tools/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="github-goodness-" style="position:relative;">GitHub Goodness 😎<a href="#github-goodness-" aria-label="github goodness permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/3ohzdIuqJoo8QdKlnW/giphy.gif" alt="Will Ferrell Reaction GIF"></p>
<p>In case you missed it we now have an
<a href="https://github.com/iterative/awesome-iterative-projects" target="_blank" rel="nofollow noopener noreferrer">Awesome Iterative Projects Repository.</a>
This repository is a list of projects relying on Iterative tools to achieve
awesomeness. Recent additions to the list include:</p>
<ul>
<li><a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">zincware/ZnTrack</a>: Create, visualize,
run & benchmark DVC pipelines in Python & Jupyter notebooks.</li>
<li><a href="https://github.com/gennaro-tedesco/nvim-dvc" target="_blank" rel="nofollow noopener noreferrer">nvim-dvc</a>: Neovim plugin for
DVC.</li>
</ul>
<p>We'd love to see more of the Community's awesome work added to this list. Feel
free to submit your project!</p>
<p>Other repos that came across my radar this last month that may be of interest to
our Community:</p>
<ul>
<li><a href="https://github.com/Nachimak28/awesome-list-of-awesomes" target="_blank" rel="nofollow noopener noreferrer">An Awesome List of Awesomes</a>:
an aggregation of all the Awesome lists</li>
<li><a href="https://github.com/visenger/awesome-mlops" target="_blank" rel="nofollow noopener noreferrer">Awesome MLOps</a>: an awesome list of
references for MLOps.</li>
<li><a href="https://github.com/mateuspicanco/project-atlas-sao-paulo" target="_blank" rel="nofollow noopener noreferrer">Project Atlas - São Paulo</a>
: a Data Science and Engineering initiative that aims to develop relevant and
curated Geospatial features of São Paulo, Brazil (includes DVC).</li>
<li><a href="https://github.com/lucmos/nn-template" target="_blank" rel="nofollow noopener noreferrer">NN Template</a>: Generic template to
bootstrap your PyTorch project (includes DVC)</li>
</ul>
<h1 id="deciding-on-mlops-tools" style="position:relative;">Deciding on MLOps tools?<a href="#deciding-on-mlops-tools" aria-label="deciding on mlops tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/3ohjUZZEFfWJfaeKUE/giphy.gif" alt="Think Season 2 GIF by Portlandia"></p>
<p><a href="https://media.giphy.com/media/3ohjUZZEFfWJfaeKUE/giphy.gif" target="_blank" rel="nofollow noopener noreferrer">Last month</a> I told
you about Thoughtworks' guide to MLOps Platforms. If you prefer video content,
you may like
<a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms1?utm_source=linkedin&utm_medium=social-organic&utm_campaign=tw-webinars_2021-12&gh_src=463a2f181us" target="_blank" rel="nofollow noopener noreferrer">this webinar</a>
from <a href="https://www.linkedin.com/in/ryan-dawson-501ab9123/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ryan Dawson</strong></a> on
CD4ML covering the process of identifying the best tools for your team's needs.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/69df8067190aee55adfb3b41c1bc2d0e/39600/ryan-dawson-thoughtworks-cd4ml.png" alt="MLOPs Tool evaluation process" title="MLOPs Tool evaluation process" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Ryan Dawson's MLOps tool evaluation process
(<a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms1?utm_source=linkedin&utm_medium=social-organic&utm_campaign=tw-webinars_2021-12&gh_src=463a2f181us" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p><a href="https://www.linkedin.com/in/deanpleban/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dean Pleban</strong>,</a> CEO of
<a href="https://dagshub.com" target="_blank" rel="nofollow noopener noreferrer">DAGsHub,</a> also gave a great talk on a decision making
framework for deciding on your tools in his presentation at
<a href="https://devopsdays.org/events/2021-tel-aviv/welcome/" target="_blank" rel="nofollow noopener noreferrer">DevOpsDays Tel Aviv</a>. In
this talk you will learn guidelines and mental models that will help you choose
tools in whatever stage of the process you are in.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/XLc733qO2lE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="in-other-data-science-and-ai-news" style="position:relative;">In Other Data Science and AI News<a href="#in-other-data-science-and-ai-news" aria-label="in other data science and ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="rob-toews-ai-predictions-in-forbes" style="position:relative;">Rob Toews AI Predictions in Forbes<a href="#rob-toews-ai-predictions-in-forbes" aria-label="rob toews ai predictions in forbes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.twitter.com/_RobToews" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob Toews</strong></a> wrote
<a href="https://www.forbes.com/sites/robtoews/2021/12/22/10-ai-predictions-for-2022/?sh=559c4c8d482d" target="_blank" rel="nofollow noopener noreferrer">10 AI Predictions for 2022</a>
for <a href="https://forbes.com" target="_blank" rel="nofollow noopener noreferrer">Forbes.</a> In it he predicts more startups getting funded
in NLP than any other category, reinforcement learning to become increasingly
important, the rise of synthetic data, and powerful new AI tools being built for
video. My favorite prediction:</p>
<blockquote>
<p>Responsible AI' will begin to shift from a vague catch-all term to an
operationalized set of enterprise practices."<br>
That's good news!</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://www.forbes.com/sites/robtoews/2021/12/22/10-ai-predictions-for-2022/?sh=559c4c8d482" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">10 AI Predictions for 2022</h4>
<div class="elp-description">Rob Toews predicts the rise of NLP, reinforcement learning, operationalized responsible AI and more.</div>
<div class="elp-link">https://forbes.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-01-18/forbes-8b4621c09667e17823a38d9a3b116086.jpeg" alt="10 AI Predictions for 2022">
</div>
</a>
</section>
<p></p>
<h3 id="chip-huyens-latest-blog-post" style="position:relative;">Chip Huyen's Latest Blog Post<a href="#chip-huyens-latest-blog-post" aria-label="chip huyens latest blog post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You may remember <a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen</strong></a> from
<a href="https://huyenchip.com/2020/12/30/mlops-v2.html" target="_blank" rel="nofollow noopener noreferrer">MLOps Tooling Landscape v2</a> and
<a href="https://docs.google.com/presentation/d/15ZrLFzimfy-8ob7mJ0qHPNyVoTtSfKBF5gPPG5f0Lz8/edit#slide=id.p" target="_blank" rel="nofollow noopener noreferrer">DVC's inclusion in her Machine Learning Systems Design Lecture series</a>.
But at the turn of the new year, she published a new blog post entitled
<a href="https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html" target="_blank" rel="nofollow noopener noreferrer">Real-time machine learning: Challenges and Solutions.</a>
The article describes her learning from working with approximately 30 companies
in different industries doing real-time machine learning. She describes the
online prediction processes of batch prediction and streaming prediction.</p>
<p>Additionally she discusses continual learning and the difference between
stateless retraining (the model is trained from scratch each time), and stateful
training (the model continues training on new data) and moving from a manual
process to a more automated one. Definitely worth a read and we believe DVC and
CML can help you with your stateful training!</p>
<p>She and her team are running a <a href="https://forms.gle/dDvgF7QgpPdvJE5b8" target="_blank" rel="nofollow noopener noreferrer">survey</a> to
better understand the adoption and challenges of real-time ML. We enourage your
participation!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7ccdb3437ca25f7b5784bf6422f2a300/39600/stateful-training.png" alt="Stateful Training" title="Stateful Training" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Chip Huyen's Stateless vs.Stateful Training
(<a href="https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="vicki-boykis-top-three-fundamental-tools-for-a-machine-learning-engineer" style="position:relative;">Vicki Boykis' Top three Fundamental Tools for a Machine Learning Engineer<a href="#vicki-boykis-top-three-fundamental-tools-for-a-machine-learning-engineer" aria-label="vicki boykis top three fundamental tools for a machine learning engineer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2bcc8468ee8185a0b26767d0b75e6526/03346/git-sql-cli.jpg" alt="Git, SQL, CLI" title="Git, SQL, CLI =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
If you're interested in becoming a machine learning engineer and you're not
familiar with <a href="https://twitter.com/vboykis" target="_blank" rel="nofollow noopener noreferrer"><strong>Vicki Boykis</strong>,</a> you should be.
She has an amazing blog with years of well-written, funny, technical content on
machine learning. Her latest piece entitled
<a href="https://vickiboykis.com/2022/01/09/git-sql-cli/" target="_blank" rel="nofollow noopener noreferrer">Git, SQL, CLI</a> tells why she
thinks these three tools are fundamental tools for any technical job. We think
so too.</p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="our-online-course-is-live-" style="position:relative;">Our Online Course is Live! 🎉<a href="#our-online-course-is-live-" aria-label="our online course is live permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You can register for the FREE new course
<a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">here on the Iterative website</a>. The course is currently
in beta mode. We already have some things we are working on to make it even
better, but we would love your feedback! 🙏🏼 So far we have had some minor
glitches and a lot of positive feedback! But we want your critiques too!</p>
<p><strong>Whoever can give us feedback on any three modules by February 6th will receive
some fresh new swag!</strong></p>
<p>We are already planning our next course!</p>
<h2 id="experiment-versioning-piece-in-kdnuggets" style="position:relative;">Experiment Versioning piece in KDNuggets<a href="#experiment-versioning-piece-in-kdnuggets" aria-label="experiment versioning piece in kdnuggets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our Senior Developer Advocate
<a href="https://twitter.com/mariaKhalusova" target="_blank" rel="nofollow noopener noreferrer"><strong>Maria Khalusova</strong></a> wrote a tutorial piece
on <code>exp init</code> and experiment versioning entitled
<a href="https://www.kdnuggets.com/2021/12/versioning-machine-learning-experiments-tracking.html" target="_blank" rel="nofollow noopener noreferrer">Versioning Machine Learning Experiments vs Tracking Them.</a>
The command helps you quickly set up a pipeline and codify your experiments with
all of the factors that contributed to each of them, including data, code,
pipeline, model version and all hyperparameters. This is a step above other
experiment tracking tools and enables you to achieve true reproducibility.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.kdnuggets.com/2021/12/versioning-machine-learning-experiments-tracking.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Versioning Machine Learning Experiments vs Tracking Them</h4>
<div class="elp-description">Maria Khalusova's tutorial on DVC's `exp init` command and the next level of experiment tracking that delivers true reproducibility.</div>
<div class="elp-link">https://kdnuggets.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2022-01-18/kdnuggets-1a388aac267d8ec89f41ff66516c76bc.jpeg" alt="Versioning Machine Learning Experiments vs Tracking Them">
</div>
</a>
</section>
<p></p>
<h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have a few new team members this month!</p>
<p><a href="https://github.com/dtrifiro" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniele Trifirò</strong></a> is our first team member from
Italy! He joins us as a Senior Software Engineer. Daniele has a background in
Physics/Astrophysics and worked for 4 years as a researcher in the LIGO
Scientific collaboration and then went on to positions at Cloudian and illimity.
It was at illimity where he "fell in love" with DVC! In his free time Daniele
likes listening to and sometimes playing music himself, as well as rock
climbing. 🧗🏼♂️</p>
<p><a href="https://github.com/yathomasi" target="_blank" rel="nofollow noopener noreferrer"><strong>Thomas Kunwar</strong></a> is a software engineer joining
the team from Nepal. He's been working as a fullstack developer specializing in
the MERN stack and has lead a team on multiple projects. In his free time Thomas
enjoys trekking, watching and playing sports, watching movies, and learning.
Welcome Thomas! 👏🏼</p>
<p><a href="https://github.com/madhur-tandon" target="_blank" rel="nofollow noopener noreferrer"><strong>Madhur Tandon</strong></a> joins our team as a
Software Engineer from Delhi, India. He is active in open source and some of his
famous contributions are to projects such as Pyodide (the Python Scientific
Stack compiled to WebAssembly) and Jupyterlite (a Jupyter distribution running
in the browser). He has also been a speaker in PyData and JupyterCon. Talk to
him about his solo trip to SF, his experiences at Mozilla or about books, Indian
governance, food, and crypto. When not working, he is working out!💪🏼</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with these amazing new additions to the team, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions and share with anyone you think may be
interested! 🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative is Hiring
(<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="january-office-hours" style="position:relative;">January Office Hours!<a href="#january-office-hours" aria-label="january office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Be sure to join us at the
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">January Office Hours Meetup,</a>
where <a href="https://www.linkedin.com/in/gennarotedesco/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gennaro Todesco</strong>,</a> Senior
Data Scientist at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io,</a> will present his workflow
with DVC and CML. <a href="https://www.linkedin.com/in/tezan-sahu/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong>,</a>
will follow presenting a workflow from a series of tutorials that we shared from
him in the
<a href="https://dvc.org/blog/september-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">September Heartbeat,</a>
including DVC, PyCaret, MLFlow and FastAPI.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">January Office Hours Meetup - 2 workflows</h4>
<div class="elp-description">RSVP for DVC Office Hours - 2 Workflows with integrations including Neovim, PyCaret, MLFlow and FastAPI!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-12-15/office-hours-meetup-07ea44242950433d0f1062e2bd5ef52f.png" alt="January Office Hours Meetup - 2 workflows">
</div>
</a>
</section>
<p></p>
<h3 id="milecia-mc-gregor-at-conf-42" style="position:relative;">Milecia Mc Gregor at Conf 42<a href="#milecia-mc-gregor-at-conf-42" aria-label="milecia mc gregor at conf 42 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 375px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e99cd124f97f8c84b053a3c79c40a84e/39600/Conf42.png" alt="Conf42" title="Milecia McGregor at Conf42 =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Don't miss <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> at the
upcoming
<a href="https://www.conf42.com/Python_2022_Milecia_McGregor_reproducible_experiments_better_ml_models" target="_blank" rel="nofollow noopener noreferrer">Conf42</a>
on January 27th! She will be presenting her talk on Using Reproducible
Experiments To Create Better Machine Learning Models. If you haven't caught this
talk yet, now's the time!</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/December-21-community-gemshttps://dvc.org/blog/December-21-community-gemsTue, 21 Dec 2021 00:00:00 GMT<h3 id="im-using-google-drive-as-a-remote-storage-and-accidentally-entered-the-verification-from-the-wrong-google-account-how-can-i-edit-that" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/908437162150739978" target="_blank" rel="nofollow noopener noreferrer">I'm using Google Drive as a remote storage and accidentally entered the verification from the wrong Google account. How can I edit that?</a><a href="#im-using-google-drive-as-a-remote-storage-and-accidentally-entered-the-verification-from-the-wrong-google-account-how-can-i-edit-that" aria-label="im using google drive as a remote storage and accidentally entered the verification from the wrong google account how can i edit that permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>No problem @fireballpoint1! This happens sometimes.</p>
<p>You should be able to run the following command in your terminal and then
re-enter your credentials.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">rm</span> .dvc/tmp/gdrive-user-credentials.json</span></code></pre></div>
<p>That should give you a chance to enter the correct credentials when you try to
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> again.</p>
<h3 id="can-i-add-a-dvc-remote-which-refers-to-nas-by-ip-so-i-dont-have-to-mount-on-every-computer" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/912667503283564544" target="_blank" rel="nofollow noopener noreferrer">Can I add a <code>dvc remote</code> which refers to NAS by IP so I don't have to mount on every computer?</a><a href="#can-i-add-a-dvc-remote-which-refers-to-nas-by-ip-so-i-dont-have-to-mount-on-every-computer" aria-label="can i add a dvc remote which refers to nas by ip so i dont have to mount on every computer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>That's a new question for us @Krzysztof Begiedza!</p>
<p>If you enable the SSH service on your NAS, you can configure DVC to use it as an
SSH remote with <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a>.</p>
<p>There should also be DSM (Synology DiskStation Manager) packages for webdav as
well, if you prefer that over SSH. Just make sure that when you run
<a href="https://dvc.org/doc/command-reference/remote/add#-d"><code>dvc remote add -d storage <URL></code></a>, your remote storage URL looks similar to
this.</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">webdav://<ip>/<path></code></pre></div>
<h3 id="can-you-selectively-dvc-pull-data-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/913713923667148850" target="_blank" rel="nofollow noopener noreferrer">Can you selectively <code>dvc pull</code> data files?</a><a href="#can-you-selectively-dvc-pull-data-files" aria-label="can you selectively dvc pull data files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question @Clemens!</p>
<p>You would run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull <file></code></a> to get the files you want. You could also use
the <code>--glob</code> option on <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and DVC will only pull the relevant files.</p>
<p>The command for that pull would be similar to this.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> path/to/specific/file</span></code></pre></div>
<p>You could also make a
<a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry</a> and use
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> in other projects to get a specific dataset. That way you don't
have to do a granular pull.</p>
<h3 id="what-is-the-fastest-way-to-get-the-specific-value-of-a-metric-of-an-experiment-based-on-experiment-id" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/916328260856590346" target="_blank" rel="nofollow noopener noreferrer">What is the fastest way to get the specific value of a metric of an experiment based on experiment id?</a><a href="#what-is-the-fastest-way-to-get-the-specific-value-of-a-metric-of-an-experiment-based-on-experiment-id" aria-label="what is the fastest way to get the specific value of a metric of an experiment based on experiment id permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>That's a really good question @Kwon-Young!</p>
<p>You can always look through experiment metrics with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> and this
shows you all of the experiments you've run.</p>
<p>To get the metrics for a specific experiment or set of experiments, you'll need
the experiment ids and then you can use the Python API like this example.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo
dvc <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token string">"."</span><span class="token punctuation">)</span> <span class="token comment"># or Repo("path/to/repo/dir")</span>
metrics <span class="token operator">=</span> dvc<span class="token punctuation">.</span>metrics<span class="token punctuation">.</span>show<span class="token punctuation">(</span>revs<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"exp-name1"</span><span class="token punctuation">,</span> <span class="token string">"exp-name2"</span><span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
<p>This returns a Python dictionary that contains what gets displayed in
<a href="https://dvc.org/doc/command-reference/metrics/show#--json"><code>dvc metrics show --json</code></a> except you're able to specify the experiments you
want.</p>
<h3 id="is-it-possible-to-run-the-whole-pipeline-but-only-for-one-element-of-the-foreach" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/915986804577026088" target="_blank" rel="nofollow noopener noreferrer">Is it possible to run the whole pipeline but only for one element of the <code>foreach</code>?</a><a href="#is-it-possible-to-run-the-whole-pipeline-but-only-for-one-element-of-the-foreach" aria-label="is it possible to run the whole pipeline but only for one element of the foreach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another great question from @vgodie!</p>
<p>If your stages look something like this for example:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">cleanups</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token comment"># List of simple values</span>
<span class="token punctuation">-</span> raw1
<span class="token punctuation">-</span> labels1
<span class="token punctuation">-</span> raw2
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> clean.py "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>"
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.cln
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">3</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token punctuation">-</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span>
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>item.epochs<span class="token punctuation">}</span> $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span></code></pre></div>
<p>You should try the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> cleanups@labels1</span></code></pre></div>
<p>This will run your whole pipeline, but only with <code>labels1</code> in the <code>cleanups</code>
stage.</p>
<h3 id="is-it-possible-to-pull-experiments-from-the-remote-without-checking-out-the-base-commit-of-those-experiments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/910481311905505290" target="_blank" rel="nofollow noopener noreferrer">Is it possible to pull experiments from the remote without checking out the base commit of those experiments?</a><a href="#is-it-possible-to-pull-experiments-from-the-remote-without-checking-out-the-base-commit-of-those-experiments" aria-label="is it possible to pull experiments from the remote without checking out the base commit of those experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @mattlbeck!</p>
<p>You should be able to do this with <a href="https://dvc.org/doc/command-reference/exp/pull#-name"><code>dvc exp pull origin exp-name</code></a>.</p>
<p>If you have experiments with the same name on different commits, using
<code>exp-name</code> won't work since it defaults to selecting the one based on your
current commit if there are duplicate names.</p>
<p>To work around this, you can use the full refname, like
<code>refs/exps/e7/78ad744e8d0cd59ddqc65d5d698cf102533f85/exp-6cb7</code>, to specify the
experiments that you want to work with.</p>
<h3 id="how-should-i-handle-checkpoints-in-pytorch-lightning-with-dvclive" style="position:relative;"><a href="https://drive.google.com/file/d/1t0wPowk-PUinNjV4xchrzPZh7xsI8i37/view?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">How should I handle checkpoints in PyTorch Lightning with DVCLive?</a><a href="#how-should-i-handle-checkpoints-in-pytorch-lightning-with-dvclive" aria-label="how should i handle checkpoints in pytorch lightning with dvclive permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a really good question that came from one of our Office Hours talks!
Thanks <a href="https://www.linkedin.com/in/sirily/" target="_blank" rel="nofollow noopener noreferrer">Ilia Sirotkin</a>!</p>
<p>We have an <a href="https://github.com/iterative/dvclive/issues/170" target="_blank" rel="nofollow noopener noreferrer">open issue</a> we
encourage you to follow for more details and to even contribute!</p>
<p>Python Lightning handles checkpoints differently from other libraries. This
affects the way metrics logging is executed and how models are saved.</p>
<p>You can write a custom callback to control saving everything and track it with
DVC and this is the workaround we suggest. You can implement the
<code>after_save_checkpoint</code> method and save the model file.</p>
<p>The way this works is by breaking your training process into small stages. You
should specify the stage’s checkpoint as the output of the stage and set it as a
dependency for the next stage. That way if something breaks, the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>
command will resume your experiment from the last stage.</p>
<p>Your pipeline might look something like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">stage_0</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> checkpoints/checkpoint_epoch=0.ckpt
<span class="token key atrule">next</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">prev</span><span class="token punctuation">:</span> <span class="token number">0</span>
<span class="token key atrule">next</span><span class="token punctuation">:</span> <span class="token number">1</span>
<span class="token punctuation">-</span> <span class="token key atrule">prev</span><span class="token punctuation">:</span> <span class="token number">1</span>
<span class="token key atrule">next</span><span class="token punctuation">:</span> <span class="token number">2</span>
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token punctuation">-</span><span class="token punctuation">-</span>checkpoint $<span class="token punctuation">{</span>item.prev<span class="token punctuation">}</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> checkpoints/checkpoint_epoch=$<span class="token punctuation">{</span>item.prev<span class="token punctuation">}</span>.ckpt
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> checkpoints/checkpoint_epoch=$<span class="token punctuation">{</span>item.next<span class="token punctuation">}</span>.ckpt</code></pre></div>
<p>Then you'll need to reuse the <code>ModelCheckpoint</code> that is included in
<code>pytorch_lightning</code> to capture the checkpoints. Here's a snippet of what that
could look like in your training script:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># set checkpoint path</span>
ckpt_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>abspath<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>dirname<span class="token punctuation">(</span>__file__<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"checkpoints"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># checkpoints will be saved to checkpoints/checkpoint_epoch={epoch_number}.ckpt</span>
cp <span class="token operator">=</span> pl<span class="token punctuation">.</span>callbacks<span class="token punctuation">.</span>model_checkpoint<span class="token punctuation">.</span>ModelCheckpoint<span class="token punctuation">(</span>
monitor<span class="token operator">=</span><span class="token string">"train_loss_epoch"</span><span class="token punctuation">,</span>
save_top_k<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">,</span>
dirpath<span class="token operator">=</span>ckpt_path<span class="token punctuation">,</span>
filename<span class="token operator">=</span><span class="token string">'checkpoint_{epoch}'</span><span class="token punctuation">)</span></code></pre></div>
<h3 id="is-there-a-feature-for-dvc-to-only-sample-and-cache-a-subset-of-the-tracked-dataset-eg-10000-lines-of-a-large-file" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/917778575845900340" target="_blank" rel="nofollow noopener noreferrer">Is there a feature for DVC to only sample and cache a subset of the tracked dataset, e.g. 10000 lines of a large file?</a><a href="#is-there-a-feature-for-dvc-to-only-sample-and-cache-a-subset-of-the-tracked-dataset-eg-10000-lines-of-a-large-file" aria-label="is there a feature for dvc to only sample and cache a subset of the tracked dataset eg 10000 lines of a large file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Really great question @Abdi!</p>
<p>You should be able to use the streaming capability of the DVC API to achieve
this goal.</p>
<p>Here is an example of a Python script that would do this:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> <span class="token builtin">open</span> <span class="token keyword">as</span> dvcopen
<span class="token keyword">with</span> dvcopen<span class="token punctuation">(</span><span class="token string">'data'</span><span class="token punctuation">,</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{</span>repo_url<span class="token punctuation">}</span></span><span class="token string">'</span></span><span class="token punctuation">)</span> <span class="token keyword">as</span> fd<span class="token punctuation">:</span>
<span class="token keyword">for</span> line <span class="token keyword">in</span> fd<span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span></code></pre></div>
<hr>
<p><img src="https://media.giphy.com/media/h5Ct5uxV5RfwY/giphy.gif" alt="Done Tyler The Creator GIF"></p>
<p>At our January Office Hours Meetup we will be looking at machine learning
workflows and Neovim-DVC plugin!
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/december-21-heartbeathttps://dvc.org/blog/december-21-heartbeatWed, 15 Dec 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We've made it to the end of the year! 2021 has been an amazing journey for us
and our growing Community all over the world. There's lots of great news this
month. Let's not waste a heartbeat and get right to it! 😉</p>
<p><img src="https://media.giphy.com/media/YAIOuXv2zYDW8/giphy.gif" alt="Heartbeat!"></p>
<h2 id="dvc--cml--rasa--️" style="position:relative;">DVC + CML + RASA = ❤️<a href="#dvc--cml--rasa--%EF%B8%8F" aria-label="dvc cml rasa ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matthew Upson</strong></a>, Founder at
<a href="https://mantisnlp.com/" target="_blank" rel="nofollow noopener noreferrer">MantisNLP,</a> an AI consultancy focused on NLP, along
with his team, put out the
<a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-i-beec756e8e7f" target="_blank" rel="nofollow noopener noreferrer">first blog post</a>
in a series showing how to use DVC and CML along with Rasa in developing
conversational AI. This post sets the scene for the following more detailed
parts, but lays out DVC's use for generating the DAG as well as logging metrics
and using CML to do the testing. We're looking forward to the next installments!</p>
<p><img src="https://media.giphy.com/media/HYrBxW4xsPSP3wsUTk/giphy.gif" alt="Heartbeat!"></p>
<h2 id="curious-about-speaker-diarization" style="position:relative;">Curious about Speaker Diarization?<a href="#curious-about-speaker-diarization" aria-label="curious about speaker diarization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://blogs.cisco.com/developer/speakerdiarization01" target="_blank" rel="nofollow noopener noreferrer">The co-authored article entitled,</a>
“Who Said That?” A Technical Intro to Speaker Diarization," by
<a href="https://www.linkedin.com/in/dariocazzani/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dario Cazzani</strong></a>, and
<a href="https://github.com/alhuang10" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Huang</strong></a>, machine learning engineers at
<a href="https://www.cisco.com/" target="_blank" rel="nofollow noopener noreferrer">Cisco,</a> provides an introduction to the topic of
Speaker Diarization, or who spoke when, in audio recordings. Their team's
solution takes you through the fingerprinting of voices, clustering to assign
speaker labels, creating the needed data pipeline, and the integration with
Webex.</p>
<p>In this process, the team derives benefit from using DVC to version data and
models, as well as easily collaborate with each other and the transcription
team. More info on this project can be found
<a href="https://github.com/CiscoDevNet/vo-id#train-the-vectorizer" target="_blank" rel="nofollow noopener noreferrer">in their repo.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb6417a6abeb72008cb2c97e3cf72fad/39600/Dario-Cazzani-2.png" alt="Speaker Diarization" title="Speaker Diarization" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Dario Cazzani and team's process for assinging speaker labels to audio files
(<a href="https://blogs.cisco.com/developer/speakerdiarization01" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="using-dvc-in-academic-research-on-a-compartmental-infectious-disease-model" style="position:relative;">Using DVC in Academic Research on a Compartmental Infectious Disease Model<a href="#using-dvc-in-academic-research-on-a-compartmental-infectious-disease-model" aria-label="using dvc in academic research on a compartmental infectious disease model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/matthew-segal-aa132093/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matthew Segal</strong>,</a>
<a href="https://mattsegal.dev/devops-academic-research.html" target="_blank" rel="nofollow noopener noreferrer">in his post,</a> "DevOps in
Academic Research," reviews his work of applying some of the tried and true
practices in DevOps to data science projects using a
<a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo" target="_blank" rel="nofollow noopener noreferrer">Markov chain Monte Carlo</a>
(MCMC) technique to create a model to simulate the spread of tuberculosis and
later, as the pandemic erupted, COVID-19.</p>
<p>The article covers mapping the workflow (see below), testing the codebase, smoke
tests
<a href="https://mattsegal.dev/pytest-on-github-actions.html" target="_blank" rel="nofollow noopener noreferrer">(with a guide link),</a>
contiunuous integration, and data management (where he recommends DVC).</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf57bc814a51e86085b09a83a3717d48/39600/matt-segal.png" alt="Map Pipeline" title="Map Pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Working to
develop a pipeline
(<a href="https://mattsegal.dev/devops-academic-research.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="are-you-confused-by-how-many-mlops-tools-there-are" style="position:relative;">Are you confused by how many MLOps tools there are?<a href="#are-you-confused-by-how-many-mlops-tools-there-are" aria-label="are you confused by how many mlops tools there are permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 450px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9c0dab3e5e841de11778d0a64e7b89e/39600/thoughtworks-mlops-landscape.png" alt="Thoughtworks Trianglethoughtwork" title="Thoughtworks Platform vs. Specialist Triangle =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Well
<a href="https://www.thoughtworks.com/?utm_source=google-search&utm_medium=paid-media&utm_campaign=always-on-brand_2021-11&utm_term=thoughtworks&utm_content=RSAad1&gclid=Cj0KCQiA2NaNBhDvARIsAEw55hg2li5srltu8ppVsxLzcnv-pYWRmvnCk_jmljiC2ocyM4tc7EUEt9gaAoVWEALw_wcB" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks</a>
included DVC in its recent
<a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks Guide to MLOps Platforms</a>.
While being included is great, things move so fast that they seemed to have
missed our experiment capabilities and the CI/CD capabilities for machine
learning of CML.🤔</p>
<p>And if they only knew what's to come! 🚀 Lots planned in the new year!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca7a536e93ac17fbb71c711a1dd1738f/c6e3d/more-tools.png" alt="They don't know DVC has more tools coming" title="They don't know DVC has more tools coming" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Dmitry Petrov's meme
(<a href="https://twitter.com/FullStackML/status/1465428233336201218?s=20" target="_blank" rel="nofollow noopener noreferrer">Source Link</a>)</em></p>
<h2 id="what-is-mlops---everything-you-must-know-to-get-started" style="position:relative;">What is MLOps - Everything You Must Know to Get Started<a href="#what-is-mlops---everything-you-must-know-to-get-started" aria-label="what is mlops everything you must know to get started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In his post,
<a href="https://towardsdatascience.com/what-is-mlops-everything-you-must-know-to-get-started-523f2d0b8bd8" target="_blank" rel="nofollow noopener noreferrer">What is MLOps - Everything You Need to Know to Get Started,</a>
<a href="https://www.linkedin.com/in/tyagiharshit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Harshit Tyagi</strong></a> provides an
overview of MLOps and why it's necessary for today's ML and AI to production
projects. You will learn the different parts of the puzzle that make up MLOps,
and review the machine learning life cycle. In the post, Harshit also provides a
video of the concepts as well as an interview with our CEO,
<a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong>.</a> Be sure to check it out!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/84d1c5a8e04dc183f95961ae2cb797b9/03346/harshit-tyagi.jpg" alt="What is MLOps" title="What is MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Harshit Tyagi's ML Systems Engineering and Operations with their Stakeholders
(<a href="https://towardsdatascience.com/what-is-mlops-everything-you-must-know-to-get-started-523f2d0b8bd8" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="using-antipatterns-to-avoid-mlops-mistakes" style="position:relative;">Using AntiPatterns to avoid MLOps Mistakes<a href="#using-antipatterns-to-avoid-mlops-mistakes" aria-label="using antipatterns to avoid mlops mistakes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/nikhilmuralidhar/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nikhil Maralidhar</strong>,</a> et. al.,
in their survey paper,
<a href="https://arxiv.org/abs/2107.00079" target="_blank" rel="nofollow noopener noreferrer">Using AntiPatterns to avoid MLOps Mistakes,</a>
aim to develop a vocabulary for anti-patterns found in machine learning projects
in the financial services industry. In the paper, they also give recommendations
for acheiving MLOps at an enterprise scale using processes for documentation and
management. Luckily, our tools help you to solve some of these challenges!</p>
<p>You can also catch Nikhil's interview with
<a href="https://twitter.com/bigdata" target="_blank" rel="nofollow noopener noreferrer"><strong>Ben Lorica</strong></a> from
<a href="https://thedataexchange.media/" target="_blank" rel="nofollow noopener noreferrer">The Data Exchange</a>
<a href="https://thedataexchange.media/mlops-anti-patterns/" target="_blank" rel="nofollow noopener noreferrer">podcast here.</a></p>
<p>
</p><section class="elp-content-holder">
<a href="https://arxiv.org/abs/2107.00079" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Using AntiPatterns to avoid MLOps Mistakes</h4>
<div class="elp-description">Nikhil Maralidhar, et. al. paper on AntiPatterns in MLOps in the Financial Services industry and recommendations for improving machine learning operations.</div>
<div class="elp-link">https://arxiv.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-12-15/arxiv-9d99ec14ee87d2be7259ac0639bf93f9.png" alt="Using AntiPatterns to avoid MLOps Mistakes">
</div>
</a>
</section>
<p></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="new-team-member" style="position:relative;">New Team Member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/iamritghimire/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amrit Ghimire</strong></a> joins our Studio
team as a back end developer, from Pokhara, Nepal. Prior to joining Iterative,
he lead a team at Leapfrog, Inc. to develop applications for a drug discovery
company. Amrit likes to read and watch movies in this free time and works to
complete reading 3-4 books per month. Finally he enjoys working in Python, Rust
and customizing Linux systems and personal command line automations. Welcome
Amrit! 🎉</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As always, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Software Engineer (ML, Labeling, Python)</li>
<li>Senior FronteEnd Engineer (Typescript, Node, React)</li>
<li>Senior Software Engineer (ML, DevTools, Python)</li>
<li>Senior Software Engineer (ML, Data Infra, GoLang)</li>
<li>Field Data Scientist / Sales Engineer</li>
<li>Developer Advocate (Machine Learning)</li>
<li>Director / VP of Engineering (ML, DevTools)</li>
<li>Director / VP of Product (ML, Data Infra, SaaS)</li>
<li>Head of Talent</li>
<li>Head of DevRel</li>
<li>Account Executive (Sales)</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. Come join our
rocket ship! 🚀</p>
<p><img src="https://media.giphy.com/media/3xz2BzSNxkwPqF8Wdy/giphy.gif" alt="Go Team Nasa GIF"></p>
<h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The DVC team has been steadily adding to the Experiment Management section of
our docs. We want to make sure that all your experiment versioinng needs are met
and there's more to come! 🚀</p>
<p><img src="https://media.giphy.com/media/5qy3GWYwCydByEn3O6/giphy.gif" alt="Dvc GIF"></p>
<p>And don't miss
<a href="https://dvc.org/doc/use-cases/experiment-tracking" target="_blank" rel="nofollow noopener noreferrer">the latest Use Case on Machine Learning Experiment Tracking,</a>
which outlines going from the traditional, painful, note taking, to more
advanced methods, and compares how DVC can take you to the next level!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7283f3d9051c553163d7643fb6e936f0/39600/natural-experimentation.png" alt="Machine Learning Experiment Tracking" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Tired of this? Check out our docs!
(<a href="https://dvc.org/doc/use-cases/experiment-tracking" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="dvc-online-course-update" style="position:relative;">DVC Online Course Update!<a href="#dvc-online-course-update" aria-label="dvc online course update permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The course is in editing mode and this week we are getting the second cuts for
review. The first course will focus on DVC for Data Scientists and Analysts. The
course is on track to be out by the end of the year! It will be 100% <strong>FREE</strong>
and available from our websites. We are so excited about how it's coming to
life! 🚀</p>
<p>👀 Note the the Udemy channel in Discord has now changed to
#iterative-online-course. We're getting ready!</p>
<p><img src="https://media.giphy.com/media/xUOxfh6ZM75efM3Bqo/giphy.gif" alt="You Can Do It GIF by chuber channel"></p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Be sure to join us at the
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">January Office Hours Meetup,</a>
where <a href="https://www.linkedin.com/in/gennarotedesco/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gennaro Todesco</strong>,</a> Senior
Data Scientist at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io,</a> will present his workflow
with DVC and CML, and his Neovim-DVC plugin.
<a href="https://www.linkedin.com/in/tezan-sahu/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong>,</a> will follow
presenting a workflow from a series of tutorials that we shared from him in the
<a href="https://dvc.org/blog/september-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">September Heartbeat,</a>
including DVC, PyCaret, MLFlow and FastAPI.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">January Office Hours Meetup - 2 workflows</h4>
<div class="elp-description">RSVP for DVC Office Hours - 2 Workflows with integrations including Neovim, PyCaret, MLFlow and FastAPI!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-12-15/office-hours-meetup-07ea44242950433d0f1062e2bd5ef52f.png" alt="January Office Hours Meetup - 2 workflows">
</div>
</a>
</section>
<p></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There were many candidates this month. Check out our Testimonials Wall of Love,
which is now live on our <a href="https://dvc.org/community" target="_blank" rel="nofollow noopener noreferrer">Community Page</a> and holds
many of our favorite Tweets! If you'd like to give a shout our for our tools
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">head here</a>
to make a video or written testimonial. We'd appreciate it! 🙏🏼</p>
<p>But for this month, this Tweet wins the coveted Tweet Love slot.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Playing with Data Science Version control (DVC) from <a href="https://twitter.com/Iterativeai">@Iterativeai</a> - amazing how much it has progressed since I looked at it a couple of years ago</p>— Chris Samiullah (@ChrisSamiullah) <a href="https://twitter.com/ChrisSamiullah/status/1461702483965886468">November 19, 2021</a></blockquote>
<h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And with that we close out the year! We send a huge thank you to all of our
Community members that help us make our tools better. Thank you for your
contributions, trust and feedback! We look forward to continue to grow with you
in 2022! Have a wonderful holiday season and Happy New Year! 🎉</p>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/collaborative-experimentshttps://dvc.org/blog/collaborative-experimentsMon, 13 Dec 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Sharing experiments to compare machine learning models is important when you're
working with a team of engineers. You might need to get another opinion on an
experiments results. You might need to share a modified dataset or even share
the exact reproduction of a specific experiment.</p>
<p>Setting up DVC remotes in addition to your Git remotes lets you share all of the
data, code, and hyperparameters associated with each experiment so anyone can
pick up where you left off in the training process. We'll go through an example
of sharing an experiment with DVC remotes.</p>
<h2 id="forking-the-project" style="position:relative;">Forking the project<a href="#forking-the-project" aria-label="forking the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To follow along, fork
<a href="https://github.com/iterative/example-dvc-experiments" target="_blank" rel="nofollow noopener noreferrer">this repo</a> as one of your
own GitHub repos. That way you'll have pull access when we start working with
DVC. This repo has different tags that show the progression of the project and
you're welcome to check them out!</p>
<p>To get the branch we'll use in this post, you can run this command to clone your
forked repo. Make sure to replace <code><your_github></code> with your GitHub name.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:<span class="token operator"><</span>your_github<span class="token operator">></span>/example-dvc-experiments.git <span class="token parameter variable">-b</span> get-started</span></code></pre></div>
<p>This project already has DVC files set up to run experiments, but if you want to
follow along with a project you're currently working on, make sure to check out
the steps to initialize a DVC pipeline in
<a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">the Getting Started doc</a>.</p>
<h2 id="setting-up-your-dvc-remotes" style="position:relative;">Setting up your DVC remotes<a href="#setting-up-your-dvc-remotes" aria-label="setting up your dvc remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When you want to share the progress you've made with training your model, that
usually means you need to find a way to bundle the code, data, and
hyperparameters. This could be a complicated process if you're working with GBs
worth of data or you have a large number of hyperparameters.</p>
<p>That's one of the uses for DVC and why we'll be working with remotes. To start
with, make sure your GitHub remote is configured correctly. It should use the
SSH version of the URL. This is so DVC can authenticate the pushes and pulls
from GitHub it needs as part of experiment sharing.</p>
<p>The way DVC works is by storing custom Git refs in your repo with metadata that
defines the experiment. You can learn more about how DVC uses custom Git refs in
<a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this post</a>.</p>
<p>Next, you'll need to set up a remote to your data location. This could be an AWS
S3 bucket, a Google Drive, or
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">one of the other supported storage types</a>.</p>
<p>An important thing to note about the project we're working with is that there is
already a remote set up for you to pull from. You can see this in <code>.dvc/config</code>.
You'll need to set up a separate remote to push changes to since this remote
doesn't allow push access.</p>
<p>For this example, we'll be using a Google Drive folder as the remote to handle
pushing data. Now that you know what we're doing, let's run the command to set
up the DVC remote to push to.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> cloud_remote gdrive://1k6aUYWphOulJlXgq4XbfKExWGyymTpEl</span></code></pre></div>
<p>This adds the remote storage named <code>cloud_remote</code> for DVC to track and we'll be
able to push and pull the exact code and data to reproduce any experiment. With
your Git remote and DVC remotes in place, you can start pulling data and
experiments from the cloud to your local machine.</p>
<p><em>Note: Make sure you have write permissions to the Git remote!</em></p>
<h2 id="listing-your-remote-experiments" style="position:relative;">Listing your remote experiments<a href="#listing-your-remote-experiments" aria-label="listing your remote experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When you're working with a team on an existing project, you might want to see
the experiments already in the remotes so you know what's available. To take a
look at the experiments we have run in the repo you forked, you'll have to set
up a new Git upsteam remote to reference the original repo. You can do that with
the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git remote add</span> upstream https://github.com/iterative/example-dvc-experiment</span></code></pre></div>
<p>Now you can take a look at all of the experiments we have associated with this
repo with the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> upstream <span class="token parameter variable">--all</span></span></code></pre></div>
<p>You'll get a list of all of the experiments across different Git branches that
have been pushed with DVC in the original repo. The output will look similar to
this.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">21784fa:
exp-c8dcf
main:
exp-b3667
exp-d382a</code></pre></div>
<p>Now you'll be able to pick which experiment you want to reproduce and start
testing with.</p>
<h2 id="pulling-experiments" style="position:relative;">Pulling experiments<a href="#pulling-experiments" aria-label="pulling experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you're picking up an existing project, there will likely be a specific
experiment you'll get started with. To pull an experiment to your local machine,
you'll need an experiment id for the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span> upstream exp-b3667</span></code></pre></div>
<p>The <code>exp-b3667</code> comes from the <a href="https://dvc.org/doc/command-reference/exp/list"><code>dvc exp list</code></a> command we ran earlier and now you
have all of the data and code associated with that experiment on your machine.</p>
<p>From here, you can start running new experiments with different models,
hyperparameters, or even datasets.</p>
<h2 id="pushing-experiments" style="position:relative;">Pushing experiments<a href="#pushing-experiments" aria-label="pushing experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Once you're done with your new experiments, you can push these to the Google
Drive remote we set up earlier. DVC handles both the GitHub and data storage
pushes with this command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span> origin exp-p4202</span></code></pre></div>
<p>This will push the custom Git refs to your forked repo and it will push any
artifacts, like your data or model output, to the DVC remote location. If you
have checkpoints enabled, it will also push the checkpoints of an experiment.
Now you can easily share your work with other engineers to get feedback faster
and finish projects sooner.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It's a lot easier to get help from someone on a project when you can share
everything with them. When you use DVC, you can bundle your data and code
changes for each experiment and push those to a remote for somebody else to
check out.</p>https://dvc.org/blog/ml-experiment-versioninghttps://dvc.org/blog/ml-experiment-versioningTue, 07 Dec 2021 00:00:00 GMT<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/z0s42TxH9oM?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>Experiment tracking tools help manage machine learning projects where version
control tools like Git aren't enough. They log parameters and metrics, and they
store artifacts like input data or model weights, so that you can reproduce
experiments and retrieve results. They also provide a dashboard to navigate all
this meta-information across lots of experiments.</p>
<p>Git can't manage or compare all that experiment meta-information, but it is
still better for code. Tools like GitHub make distributed collaboration easy,
and you can see incremental code changes. That's why experiments get split
between Git for code and experiment tracking tools for meta-information (usually
with a link in one or the other to keep track).</p>
<p>ML experiment versioning combines experiment tracking and version control.
Instead of managing these separately, keep everything in one place and get the
benefits of both, like:</p>
<ul>
<li><strong>Experiments as code</strong>: Track meta-information in the repository and version
it like code.</li>
<li><strong>Versioned reproducibility</strong>: Save and restore experiment state, and track
changes to only execute what's new.</li>
<li><strong>Distributed experiments</strong>: Organize locally and choose what to share,
reusing your existing repo setup.</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 537px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4f65fcb9a5c8d32158b5283122c9dd10/39600/exp-versioning.png" alt="Experiment Versioning" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h1 id="ml-experiments-as-code" style="position:relative;">ML Experiments as Code<a href="#ml-experiments-as-code" aria-label="ml experiments as code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Experiment versioning treats experiments as code. It saves all metrics,
hyperparameters, and artifact information in text files that can be versioned by
Git (DVC <a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">data versioning</a>
backs up the artifacts themselves anywhere). You do not need a centralized
database or online services. Git becomes a store for experiment
meta-information.</p>
<p>You can choose your own file formats and paths, which you can configure in DVC:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp init</span> <span class="token parameter variable">-i</span>
</span>This command will guide you to set up a default stage in dvc.yaml.
See https://dvc.org/doc/user-guide/project-structure/pipelines-files.
DVC assumes the following workspace structure:
├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── src
Command to execute: python src/train.py
Path to a code file/directory [src, n to omit]: src/train.py
Path to a data file/directory [data, n to omit]: data/images/
Path to a model file/directory [models, n to omit]:
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]: logs.csv</code></pre></div>
<p>Once you set up your repo in this structure, you start to see the benefits of
this approach. Experiment meta-information lives in readable files that are
always available, and your code can stay clean. You can read, save, and version
your meta-information:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> params.yaml
</span>train:
epochs: 10
model:
conv_units: 128</code></pre></div>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> metrics.json
</span>{"loss": 0.24310708045959473, "acc": 0.9182999730110168}</code></pre></div>
<p>You can see what changed in parameters, code, or anything else:</p>
<div class="gatsby-highlight" data-language="diff"><pre class="language-diff"><code class="language-diff">$ git diff HEAD~1 -- params.yaml
diff --git a/params.yaml b/params.yaml
index baad571a2..57d098495 100644
<span class="token coord">--- a/params.yaml</span>
<span class="token coord">+++ b/params.yaml</span>
<span class="token coord">@@ -1,5 +1,5 @@</span>
<span class="token unchanged"><span class="token prefix unchanged"> </span>train:
<span class="token prefix unchanged"> </span> epochs: 10
</span><span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>model:
<span class="token prefix deleted">-</span> conv_units: 16
</span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>model:
<span class="token prefix inserted">+</span> conv_units: 128</span></code></pre></div>
<p>With DVC, you can even compare lots of experiments from the terminal like you
would in a dashboard:</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">$ dvc exp show
</span> ─────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.epochs<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model.conv_units<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25183<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9137<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>10<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>mybranch<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>Oct 23, 2021<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>10<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>16<span class="token hide">**</span></span>
├── 9a4ff1c <span class="token bold"><span class="token hide">**</span>[exp-333c9]<span class="token hide">**</span></span> 10:40 AM 0.25183 0.9137 10 64
├── 138e6ea <span class="token bold"><span class="token hide">**</span>[exp-55e90]<span class="token hide">**</span></span> 10:28 AM 0.25784 0.9084 10 32
└── 51b0324 <span class="token bold"><span class="token hide">**</span>[exp-2b728]<span class="token hide">**</span></span> 10:17 AM 0.25829 0.9058 10 16
</span> ─────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<h1 id="versioned-reproducibility" style="position:relative;">Versioned reproducibility<a href="#versioned-reproducibility" aria-label="versioned reproducibility permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>One reason you need to track all this meta-information is to reproduce your
experiment. Experiment tracking databases save the artifacts, but you still need
to put them all back in the right place. Since experiment versioning keeps all
the meta-information in your repo, you can restore the experiment state exactly
as it was in your workspace. DVC
<a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">saves the state of the experiment</a>, and
it can restore it for you:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-333c9
</span>
Changes for experiment 'exp-333c9' have been applied to your current workspace.</code></pre></div>
<p>Reproducibility is nice, but data drift, new business requirements, bug fixes,
etc. all mean running a slightly modified experiment. You don't have time to
always start from scratch. Versioned reproducibility means tracking changes to
the experiment state. DVC can determine what changes were introduced by the
experiment and only run what's necessary. It only saves those changes, so you
don't waste time or storage on duplicate copies of data.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">model.conv_units</span><span class="token operator">=</span><span class="token number">128</span>
</span>'data/images.tar.gz.dvc' didn't change, skipping
Stage 'extract' didn't change, skipping
Running stage 'train':
> python3 src/train.py
79/79 [==============================] - 1s 14ms/step - loss: 0.2552 - acc: 0.9180
Updating lock file 'dvc.lock'
Reproduced experiment(s): exp-be916
Experiment results have been applied to your workspace.
To promote an experiment to a Git branch run:
dvc exp branch <exp> <branch></code></pre></div>
<h1 id="distributed-experiments" style="position:relative;">Distributed Experiments<a href="#distributed-experiments" aria-label="distributed experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Experiment tracking tools log experiments to a central database and show them in
a dashboard. This makes it easy to share them with teammates and compare
experiments. However, it introduces a problem - in an active experimentation
phase, you may create hundreds of experiments. Team members may be overwhelmed,
and the tool loses one of its core purposes - sharing experiments between team
members.</p>
<p>Experiment versioning piggybacks on Git and its distributed nature. All the
experiments you run are stored in your local repo, and only the best experiments
are promoted to the central repo (GitHub for example) to share with teammates.
Distributed experiments are shared with the same people as your code repo, so
you don't need to replicate your project permissions or worry about security
risks.</p>
<p>With DVC, you can push experiments just like Git branches, giving you
flexibility to share whatever, whenever, and wherever you choose:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span> origin exp-333c9
</span>Pushed experiment 'exp-333c9'to Git remote 'origin'.</code></pre></div>
<h1 id="what-next" style="position:relative;">What Next?<a href="#what-next" aria-label="what next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>These enhancements can have powerful ripple effects for fast-moving, complex,
collaborative ML projects. There are parallels to the
<a href="https://ericsink.com/vcbe/html/history_of_version_control.html" target="_blank" rel="nofollow noopener noreferrer">history of version control</a>.
Git's distributed nature and incremental change tracking were major advances
over the centralized, file-based version control systems of previous
generations. Experiment versioning can similarly advance the next generation of
experiment tracking.</p>
<p>ML experiment versioning is still in its early days. Look out for future
announcements about:</p>
<ul>
<li>Deep learning features like <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">live monitoring</a> and
<a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpointing</a>.</li>
<li>Visualizing and comparing experiment results in other tools like VS Code and
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>.</li>
</ul>
<p>What do you want to see for the next generation of experiment tracking? Join our
upcoming
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">meetup</a>
to discuss, join our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord community</a>, or
let us know in the comments!</p>https://dvc.org/blog/november-21-community-gemshttps://dvc.org/blog/november-21-community-gemsTue, 30 Nov 2021 00:00:00 GMT<h3 id="what-would-be-the-cleanest-most-pythonic-way-to-run-dvc-commands-from-inside-a-python-script-if-we-want-to-avoid-calling-the-subprocess-library" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/895570704605528094" target="_blank" rel="nofollow noopener noreferrer">What would be the cleanest, most Pythonic way to run DVC commands from inside a Python script if we want to avoid calling the subprocess library?</a><a href="#what-would-be-the-cleanest-most-pythonic-way-to-run-dvc-commands-from-inside-a-python-script-if-we-want-to-avoid-calling-the-subprocess-library" aria-label="what would be the cleanest most pythonic way to run dvc commands from inside a python script if we want to avoid calling the subprocess library permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>That's a really good question @mihaj!</p>
<p>If you want to run DVC commands in a Python script, you have a couple of
options.</p>
<p>You can work with the <code>main</code> module from the <code>dvc</code> library. This is the more
CLI-like option. An example of running an experiment would look something like
this.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>main <span class="token keyword">import</span> main
main<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
<p>The other option you have is to use the <code>Repo API</code>. This API is largely
undocumented at the moment, but it closely mirrors the CLI commands. One
exception is that they will return internal data structures instead of exit
codes.</p>
<p>Here's an example of running an experiment with the Repo API.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo
repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span>
repo<span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">)</span>
repo<span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>show<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># etc...</span></code></pre></div>
<h3 id="how-can-you-check-if-a-dvc-tracked-directory-has-changes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/899693929560158218" target="_blank" rel="nofollow noopener noreferrer">How can you check if a DVC tracked directory has changes?</a><a href="#how-can-you-check-if-a-dvc-tracked-directory-has-changes" aria-label="how can you check if a dvc tracked directory has changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question from @edran!</p>
<p>You can check which directories have been changed by running:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc status</span></span></code></pre></div>
<p>This will give you an output similar to this in your terminal:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">changed deps</span><span class="token punctuation">:</span>
<span class="token key atrule">modified</span><span class="token punctuation">:</span> src/train.py
<span class="token key atrule">changed outs</span><span class="token punctuation">:</span>
<span class="token key atrule">deleted</span><span class="token punctuation">:</span> model.pkl
<span class="token key atrule">evaluate</span><span class="token punctuation">:</span>
<span class="token key atrule">changed deps</span><span class="token punctuation">:</span>
<span class="token key atrule">deleted</span><span class="token punctuation">:</span> model.pkl</code></pre></div>
<p>We're working on adding granularity support for this command and should have a
release for this in the next few months.</p>
<h3 id="is-there-a-way-to-look-at-all-of-the-experiments-ive-run-and-see-the-metrics-and-parameters-associated-with-them" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/900451895666155520" target="_blank" rel="nofollow noopener noreferrer">Is there a way to look at all of the experiments I've run and see the metrics and parameters associated with them?</a><a href="#is-there-a-way-to-look-at-all-of-the-experiments-ive-run-and-see-the-metrics-and-parameters-associated-with-them" aria-label="is there a way to look at all of the experiments ive run and see the metrics and parameters associated with them permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for asking @GuyAR! This is a common question that comes up.</p>
<p>You can see all of your experiments and the associated metrics and parameters in
a table in the terminal by running the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div>
<p>This will give you a table that looks similar to this with all of this
information.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span>
</span> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91389<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.20506<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.66306<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
│ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09
│ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09
│ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09
├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09
│ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09
│ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09
│ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09
├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09
</span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<h3 id="whats-the-recommended-way-to-remove-data-that-has-been-imported-using-dvc-import" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/898462029650735134" target="_blank" rel="nofollow noopener noreferrer">What's the recommended way to remove data that has been imported using <code>dvc import</code>?</a><a href="#whats-the-recommended-way-to-remove-data-that-has-been-imported-using-dvc-import" aria-label="whats the recommended way to remove data that has been imported using dvc import permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question @MadsO!</p>
<p>This works the exact same as when you've added data with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>. So to remove
data, you would run this command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remove</span></span></code></pre></div>
<h3 id="when-using-a-cml-are-github-actions-gitlab-and-bitbucket-the-only-options-for-ci" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/909847110306914345" target="_blank" rel="nofollow noopener noreferrer">When using a CML, are GitHub Actions, GitLab, and BitBucket the only options for CI?</a><a href="#when-using-a-cml-are-github-actions-gitlab-and-bitbucket-the-only-options-for-ci" aria-label="when using a cml are github actions gitlab and bitbucket the only options for ci permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Currently, <code>cml runner</code> does not support CircleCI or droneCI self–hosted runners
and you would have to deploy them manually.</p>
<p>You can still use <code>cml send-comment</code>, <code>cml pr</code>, and the other CML tools with any
CI platform.</p>
<p>Thanks for this awesome question @tpietruszka!</p>
<h3 id="when-i-run-the-dvc-remove-command-does-it-only-remove-dvc-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/905382438786715648" target="_blank" rel="nofollow noopener noreferrer">When I run the <code>dvc remove</code> command, does it only remove <code>.dvc</code> files?</a><a href="#when-i-run-the-dvc-remove-command-does-it-only-remove-dvc-files" aria-label="when i run the dvc remove command does it only remove dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A really good question from @flowy!</p>
<p>That is correct. Running <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> only removes DVC tracked files and
directories. It will also remove the entry from <code>.gitignore</code> and handles the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<p>For example, if you run something like <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove folder_name/file.dvc</code></a>, only
the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file will be removed. So your updated directory would likely still
have <code>folder_name/file</code> since that was the file being tracked.</p>
<p>If you wanted to remove the tracked file as well, you would need to run
<a href="https://dvc.org/doc/command-reference/remove#--outs"><code>dvc remove --outs</code></a>. This command removes the outputs of any target.</p>
<p>If there is nothing else in the folder, you'll be left with an empty directory.
You can remove it and stop tracking in Git with a command like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> <span class="token parameter variable">-r</span> folder_name</span></code></pre></div>
<h3 id="can-dvc-studio-be-connected-to-a-self-managed-gitlab-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/907468264882462800" target="_blank" rel="nofollow noopener noreferrer">Can DVC Studio be connected to a self-managed GitLab repo?</a><a href="#can-dvc-studio-be-connected-to-a-self-managed-gitlab-repo" aria-label="can dvc studio be connected to a self managed gitlab repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Very good question about Studio @Sra!</p>
<p>Right now this only works if it's an on-premises network or a private VPC
network.</p>
<p>We are working on bringing custom-domain GitLab as a feature very soon! You can
follow
<a href="https://github.com/iterative/studio-support/issues/12" target="_blank" rel="nofollow noopener noreferrer">this GitHub issue</a> and
leave comments for anything you'd like to see!</p>
<h3 id="is-there-a-way-to-extend-default-job-execution-time-for-a-cml-runner" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/904660123161600021" target="_blank" rel="nofollow noopener noreferrer">Is there a way to extend default job execution time for a CML runner?</a><a href="#is-there-a-way-to-extend-default-job-execution-time-for-a-cml-runner" aria-label="is there a way to extend default job execution time for a cml runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There is definitely a way to do this!</p>
<p>You can extend the max time in your CI by adding something like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">5000</span></code></pre></div>
<p>If you're using GitLab, the same update would look similar to this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">timeout</span><span class="token punctuation">:</span> 72 hours</code></pre></div>
<p>Thanks for this question @evergreengt!</p>
<hr>
<p><img src="https://media.giphy.com/media/VInc9GYelUbHf5QhNR/giphy.gif" alt="Matt Fraser GIF by E!"></p>
<p>At our December Office Hours Meetup we will be doing a new feature demo you
won't want to miss!
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/november-21-heartbeathttps://dvc.org/blog/november-21-heartbeatWed, 17 Nov 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>I can't believe it's already November! Our Community has given us a lot to be
thankful for!</p>
<p><img src="https://media.giphy.com/media/vLxTOSEfHIr0A/giphy.gif" alt="Hello November!"></p>
<h2 id="thanakorn-panyapiangs-two-part-tutorial-data-versioning-with-dvc" style="position:relative;">Thanakorn Panyapiang's Two Part tutorial: Data Versioning with DVC<a href="#thanakorn-panyapiangs-two-part-tutorial-data-versioning-with-dvc" aria-label="thanakorn panyapiangs two part tutorial data versioning with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In his two part tutorial which can be found
<a href="https://medium.com/@thanakornpanyapiang/data-versioning-why-do-data-science-projects-need-it-a44cb7a471c9" target="_blank" rel="nofollow noopener noreferrer">here</a>
and
<a href="https://medium.com/@thanakornpanyapiang/data-versioning-with-dvc-a474af1247f5" target="_blank" rel="nofollow noopener noreferrer">here,</a>
<a href="https://www.linkedin.com/in/tpanyapiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thanakorn Panyapiang</strong></a> first
explains why data versioning is so important to successful machine learning
projects. Next he takes us through a tutorial of DVC showing how to install and
initiate DVC. Finally he covers tracking, pushing to remote storage, modifying
and switching the data. In the future look out for more posts on the other
features of DVC, including pipelines, metrics, experiments and continuous
integration through CML from Thanakorn!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@thanakornpanyapiang/data-versioning-with-dvc-a474af1247f5/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Versioning with DVC</h4>
<div class="elp-description">Thanakorn Panyapiang's explanation of the importance of data version control in ML projects and tutorial on DVC.</div>
<div class="elp-link">https://medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-11-17/panyapiang-6809a26f4e50374dbdb69b06c83d1cf5.jpeg" alt="Data Versioning with DVC">
</div>
</a>
</section>
<p></p>
<h2 id="sanaka-chathuranga-end-to-end-machine-learning-pipeline-with-mlops-tools" style="position:relative;">Sanaka Chathuranga: End to End Machine Learning Pipeline with MLOps tools<a href="#sanaka-chathuranga-end-to-end-machine-learning-pipeline-with-mlops-tools" aria-label="sanaka chathuranga end to end machine learning pipeline with mlops tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/shanakac/" target="_blank" rel="nofollow noopener noreferrer"><strong>Shanaka Chathuranga</strong></a> uses multiple
tools including DVC to build an end to end Machine Learning Pipeline. In the mix
you'll find Cookiecutter, DVC, Mlflow, GitHub Actions, Heroku, Flask, Evidently
AI, and PyTest in
<a href="https://medium.com/@shanakachathuranga/end-to-end-machine-learning-pipeline-with-mlops-tools-mlflow-dvc-flask-heroku-evidentlyai-github-c38b5233778c" target="_blank" rel="nofollow noopener noreferrer">his post</a>
in <a href="https://medium.com/" target="_blank" rel="nofollow noopener noreferrer">Medium.</a> DVC is used for data versioning and model
pipeline management in this tutorial.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/799a0ab79de5777ed1465c8eb8404a2e/39600/shanaka.png" alt="End to End Machine Learning Pipeline" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Shanaka Chathuranga's End to End ML Pipeline Tools Stack
(<a href="https://medium.com/@shanakachathuranga/end-to-end-machine-learning-pipeline-with-mlops-tools-mlflow-dvc-flask-heroku-evidentlyai-github-c38b5233778c" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<p>📣 Swag to the first person to do a similar tutorial using DVC for experiment
tracking and versioning and CML for CI/CD. 🚦Go!👉🏽</p>
<h2 id="covid-genomics-apache-airflow-and-dvc-integration" style="position:relative;">COVID Genomics Apache Airflow and DVC Integration<a href="#covid-genomics-apache-airflow-and-dvc-integration" aria-label="covid genomics apache airflow and dvc integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://covidgenomics.com/blog/airflow_dvc/" target="_blank" rel="nofollow noopener noreferrer">In this blog post,</a>
<a href="https://www.linkedin.com/in/piotrstyczynski/" target="_blank" rel="nofollow noopener noreferrer"><strong>Piotr Styczyński</strong></a> of
<a href="https://covidgenomics.com/" target="_blank" rel="nofollow noopener noreferrer">COVID Genomics</a> shares how they use Airflow and DVC
together in their work to model SARS Cov-2 and optimizing RT-PCR tests. They
needed to update the data used for the training model daily and automate their
processses to make sure the whole process stays up-to-date.</p>
<p>Be sure to check out the very detailed tutorial with lots of delicious code and
two repositories <a href="https://github.com/covid-genomics/airflow-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a> and
<a href="https://github.com/covid-genomics/dvc-fs" target="_blank" rel="nofollow noopener noreferrer">here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 570px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/32dbfd16f40c957d08f13e231577f71a/39600/covid-genomics.png" alt="Airflow + DVC Integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Piotr Styczyński's blog on COVID Genomics use of Airflow with DVC
(<a href="https://covidgenomics.com/blog/airflow_dvc/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="looking-to-create-a-light-weight-feature-store" style="position:relative;">Looking to create a light weight Feature Store?<a href="#looking-to-create-a-light-weight-feature-store" aria-label="looking to create a light weight feature store permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Remember <a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer"><strong>João Santiago</strong></a> from
<a href="https://github.com/jcpsantiago/dvthis" target="_blank" rel="nofollow noopener noreferrer">dvthis?</a> Well he's back at it solving ML
engineering challenges, sharing his new blog post,
<a href="https://medium.com/billie-finanzratgeber/unlocking-our-data-with-a-feature-store-402ade0743b" target="_blank" rel="nofollow noopener noreferrer">Unlocking Our Data with a Feature Store.</a>
In this article from the <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> engineering crew,
Santiago shows how they implemented a light weight feature store creating a
system in which features are defined in YAML files (gotta love those YAML files
😉) interfacing with Snowflake. Check out how they did it, and learn the term
"instarejected" which he coined and we all should instaadopt!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 509px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bce00f72f761efe5c1b6924c7e398c42/39600/billie.png" alt="Billie.io Lightweight Feature Store" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Billie.io's feature store: Snowflake + Lambda + Redis
(<a href="https://medium.com/billie-finanzratgeber/unlocking-our-data-with-a-feature-store-402ade0743b" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="learn-about-dvc-en-español" style="position:relative;">Learn about DVC en Español!<a href="#learn-about-dvc-en-espa%C3%B1ol" aria-label="learn about dvc en español permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">TryoLabs</a> held an Open Meetup recently in Uraguay
teaching about some of the technology they use at this consultancy.
<a href="https://www.linkedin.com/in/ianspektor/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ian Spektor</strong>,</a>
<a href="https://www.linkedin.com/in/diego-kiedanski/" target="_blank" rel="nofollow noopener noreferrer"><strong>Diego Kiedanski</strong>,</a> and
<a href="https://www.linkedin.com/in/nicol%C3%A1s-eiris-64916194/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicolás Eiris</strong></a>
presented on the their learnings and use of DVC to get better organization of
their data for the various projects they work on with their clients. In addition
to streamlining the onboarding of the data for their projects, DVC has provided
them reproducibility of the various data and code versions in their workflows.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/4uEjIa-f_FE?rel=0&%3B=&%3Bshowinfo=0%3B&start=268" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>Also en Español, our own
<a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> will be
presenting at
<a href="https://pybcn.org/events/pyday_bcn/pyday_bcn_2021/" target="_blank" rel="nofollow noopener noreferrer">Python Barcelona</a> on
"Making MLOps Uncool Again." In this workshop David will show you how to use
HuggingFace, DVC and CML to create an MLOps workflow, extending the power of Git
and GitHub without the need for external platforms or complicated
infrastructure.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://pybcn.org/events/pyday_bcn/pyday_bcn_2021/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Python Barcelona</h4>
<div class="elp-description">Join David de la Iglesia Castro for his workshop entitled Making MLOps Uncool Again.</div>
<div class="elp-link">https://pybcn.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-11-17/py-barcelona-8da4ea0fb572ed2e8ee49c791bc3d1b7.png" alt="Python Barcelona">
</div>
</a>
</section>
<p></p>
<h2 id="october-office-hours-video-continuum-industries-tool-stack-with-ivan-chan" style="position:relative;">October Office Hours Video: Continuum Industries Tool Stack with Ivan Chan<a href="#october-office-hours-video-continuum-industries-tool-stack-with-ivan-chan" aria-label="october office hours video continuum industries tool stack with ivan chan permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you missed last month's Office Hours
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>, you can now
catch the video! <a href="https://www.linkedin.com/in/ivanchc/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivan Chan</strong></a> took us
on a journey through the
<a href="https://www.continuum.industries/" target="_blank" rel="nofollow noopener noreferrer">Continuum Industries</a> tool stack and showed
us how they save tons of time weekly by integrating DVC and CML into their
workflows.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/TBZKfyYWtXs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="are-you-a-data-scientist-struggling-with-some-of-the-ml-engineering-concepts" style="position:relative;">Are you a Data Scientist Struggling with some of the ML engineering concepts?<a href="#are-you-a-data-scientist-struggling-with-some-of-the-ml-engineering-concepts" aria-label="are you a data scientist struggling with some of the ml engineering concepts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="atinuke-oluwabamikemi-kayode-common-github-terms-for-open-source-contributors" style="position:relative;">Atinuke Oluwabamikemi Kayode: Common Github Terms for Open Source Contributors<a href="#atinuke-oluwabamikemi-kayode-common-github-terms-for-open-source-contributors" aria-label="atinuke oluwabamikemi kayode common github terms for open source contributors permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For the learners out there,
<a href="https://twitter.com/oluwabamikemi" target="_blank" rel="nofollow noopener noreferrer"><strong>Atinuke Oluwabamikemi Kayode's</strong></a> piece
<a href="https://iambami.dev/common-github-terms-for-open-source-contributors-ckvuhdzsf0jcocms1fggb0fj3" target="_blank" rel="nofollow noopener noreferrer">Common Github Terms for Open Source Contributors</a>
shares about all the most common terminology you need to know when using GitHub
in your projects. Need to understand what "checkout" is? The difference between
"origin" and "master?" Atinuke has you covered in this piece.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://iambami.dev/common-github-terms-for-open-source-contributors-ckvuhdzsf0jcocms1fggb0fj3" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Common GitHub Terms for Open Source Contributors</h4>
<div class="elp-description">Atinuke Oluwabamikemi Kayode helps you navigate the common terminalogy in GitHub.</div>
<div class="elp-link">https://iambami.dev</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-11-17/kayode-e1b389ce00e5cf7d58417b6598467526.jpeg" alt="Common GitHub Terms for Open Source Contributors">
</div>
</a>
</section>
<p></p>
<h3 id="vincent-driessen-a-successful-git-branching-architecture" style="position:relative;">Vincent Driessen: A Successful Git Branching Architecture<a href="#vincent-driessen-a-successful-git-branching-architecture" aria-label="vincent driessen a successful git branching architecture permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For a deeper dive into how Git and versioning works, checkout
<a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">A Successful Git Branching Model</a>
piece by <a href="https://twitter.com/nvie" target="_blank" rel="nofollow noopener noreferrer"><strong>Vincent Driessen</strong></a> which explains in
detail the git branching model. While this explanation is as it relates to
software development, it will help you understand how git versioning works. This
foundation will help provide the insight into how DVC works, delivering the same
capabilities for data, models and experimentation.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 575px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5cda182ace005d6b228eae1f2de4cf92/39600/git-model.png" alt="Git Versioning in Software Development" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Vincent Driessen's Git Model Branch
(<a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h3 id="nir-barazida-notebook-to-production" style="position:relative;">Nir Barazida: Notebook to Production<a href="#nir-barazida-notebook-to-production" aria-label="nir barazida notebook to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/barazida" target="_blank" rel="nofollow noopener noreferrer"><strong>Nir Barazida</strong></a> of
<a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> brings us a blog post on
<a href="https://dagshub.com/blog/notebook-to-production-ready-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">Notebook to Production</a>
which explains why you should, and how you can, move your code from notebooks to
scripts when working on production ready ml projects. You'll see how DVC is used
to version everything in the process so your team will always know which version
of all the possible elements that go into your project produced or failed to
produce the best results.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dagshub.com/blog/notebook-to-production-ready-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Notebook to Production</h4>
<div class="elp-description">Nir Barazida shows you why and how to bring your notebook to production ready code.</div>
<div class="elp-link">https://dagshub.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-11-17/dagshub-dvc-9c3e49c8b98691eb79706d384438303d.png" alt="Notebook to Production">
</div>
</a>
</section>
<p></p>
<h2 id="dvc-online-course-update" style="position:relative;">DVC Online Course Update!<a href="#dvc-online-course-update" aria-label="dvc online course update permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We know you've wanted it, and the day is getting closer and closer! By the end
of this week we will be about 90% done recording videos for the first course,
and then it's on to video processing and platform set up. The first course will
focus on DVC for Data Scientists and Analysts. You can expect to see the course
out by the end of the year. The course will be 100% <strong>FREE</strong> and available from
our website. We are so excited about how it's coming to life! 🚀</p>
<p><img src="https://media.giphy.com/media/hL9q5k9dk9l0wGd4e0/giphy.gif" alt="Loading Downloading GIF by Vera Verreschi"></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="san-francisco-off-site" style="position:relative;">San Francisco Off-site<a href="#san-francisco-off-site" aria-label="san francisco off site permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The group of us from the Americas met in San Francisco last week. We had a great
time getting to know each other better and working on ways and processes to make
our tools even better for you! Amidst our working, we also took time out to
visit Alcatraz, go on a scavenger hunt, and eat some great food! Pictured below
from left front, going clockwise: Jorge Orpinel, Stephanie Roy, Ivan Shcheklein,
Dmitry Petrov, Dave Berenbaum, Jervis Hui, Ken Thom, Jon Burdo, Peter Rowlands,
Julie Galvan, Jeny De Figueiredo, Jordan Weber, and Maria Khalusova! 🎉</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/77dbbc25c91aa6f4b5b96ed0dd55c408/03346/team.jpg" alt="America Team Members meet in San Francisco" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative Team Members meet in San Francisco
(<a href="https://www.linkedin.com/in/jorgeorpinel/" target="_blank" rel="nofollow noopener noreferrer">Source: Jorge Orpinel</a>)</em></p>
<h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/maria-khalusova-a958aa14/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maria Khalusova</strong></a>
joins us from Montreal, Canada as a Senior Developer Advocate. Previously at Jet
Brains for 14 years, Maria brings a ton of experience in developer advocacy and
product management. She has already dove in working on CML and the upcoming
releases. She also organizes PyData Montreal. In her free time Maria likes to
spend time with her two kids, walk their mixed bull dog, and garden. 👩🏻🌾 Welcome
Maria!</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As always, we're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Software Engineer (ML, Labeling, Python)</li>
<li>Senior Software Engineer (ML, Labeling, Python)</li>
<li>Senior Software Engineer (ML, DevTools, Python)</li>
<li>Field Data Scientist / Sales Engineer</li>
<li>Developer Advocate (ML)</li>
<li>Director / VP of Engineering (ML, DevTools)</li>
<li>Director / VP of Product (ML, Data Infra, SaaS)</li>
<li>Head of Talent</li>
<li>Head of DevRel</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<p><img src="https://media.giphy.com/media/ZcQXsVrAuKMePTJYG6/giphy.gif" alt="Hyper RPG GIF"></p>
<h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This month's important doc updates come from CML! The CML team has been on fire
🔥 building new things. You will want to keep your eyes tuned to
<a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML.dev</a> and our social media channels for big news before
the end of the year!</p>
<h3 id="-cml-self-hosted-runners" style="position:relative;">📖 CML: Self-hosted Runners<a href="#-cml-self-hosted-runners" aria-label=" cml self hosted runners permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Check out the new
<a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">Self-hosted Runners</a>
doc. This will help you set up your own runners and allocate cloud computing
resources. Whether you are a GitHub or GitLab user, you will be able to toggle
between the respective code needed right there at your fingertips!</p>
<h3 id="-cml-command-reference-send-comment" style="position:relative;">📖 CML: Command Reference: <code>send-comment</code><a href="#-cml-command-reference-send-comment" aria-label=" cml command reference send comment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The new
<a href="https://cml.dev/doc/ref/send-comment#command-reference-send-comment" target="_blank" rel="nofollow noopener noreferrer">Command Reference: send-comment</a>
doc provides a way for you to post a markdown comment on a commit and flags for
associating the comment with another pull/merge request or if a <code>cml pr</code> was
used earlier in your workflow.</p>
<h3 id="-branding-assets" style="position:relative;">📖 Branding Assets<a href="#-branding-assets" aria-label=" branding assets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you are interested in writing a blog post about our tools, we now have a very
easy way for you to get your hands on our logos as well as a guide to let you
know how and where it's appropriate to use our logos and images. We love when
the Community shares about our tools!<br>
<a href="https://iterative.ai/brand" target="_blank" rel="nofollow noopener noreferrer">Find our branding assets here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b99c851fb73a6233fc6f05e59984d90/39600/brand.png" alt="Iterative.AI Branding Asseets" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative.AI branded assets from your next blog post 😉
(<a href="https://iterative.ai/brand" target="_blank" rel="nofollow noopener noreferrer">Source:</a>)</em></p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Be sure to join us at the
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">December Office Hours Meetup,</a>
where we will be showing a demo on a new feature! We can't say more just yet 🤐,
but be sure to RSVP!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Office Hours - New Feature Release</h4>
<div class="elp-description">Join us at the December Office Hours for a demo of a new feature in DVC!</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-10-15/office-hours-meetup-409c5ab48d208e9a9cdc6871fd4c0937.png" alt="DVC Office Hours - New Feature Release">
</div>
</a>
</section>
<p></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last but never least, I leave you with this great tweet from Paige Bailey, this
time about CML's docs:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🦉<a href="https://twitter.com/DVCorg">@DVCorg</a>'s docs are *shiny*—especially the sample code for generating reports, using either <a href="https://twitter.com/github">@GitHub</a> or <a href="https://twitter.com/gitlab">@GitLab</a>.<a href="https://t.co/PKPS923HUR">https://t.co/PKPS923HUR</a><br><br>All you have to do to auto-generate a report with metrics and plots, is include the YAML file in a .github/workflows folder in your repo. <a href="https://t.co/WTSZYcLjwI">pic.twitter.com/WTSZYcLjwI</a></p>— 👩💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1459395186027470849">November 13, 2021</a></blockquote>
<hr>
<p><em>Have something great to say about our tools? We'd love to hear it! Head to
<a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a>
to record or write a Testimonial! Join our
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/october-21-community-gemshttps://dvc.org/blog/october-21-community-gemsThu, 28 Oct 2021 00:00:00 GMT<h3 id="is-there-a-command-to-force-reproduce-a-specific-stage-of-a-dvc-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893056918699008000" target="_blank" rel="nofollow noopener noreferrer">Is there a command to force reproduce a specific stage of a DVC pipeline?</a><a href="#is-there-a-command-to-force-reproduce-a-specific-stage-of-a-dvc-pipeline" aria-label="is there a command to force reproduce a specific stage of a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question @wickeat!</p>
<p>You can use <a href="https://dvc.org/doc/command-reference/repro#-f"><code>dvc repro -f <stage_name></code></a>, although this will reproduce the
earlier dependency stages in the pipeline up to that point. If you only want to
reproduce a single target stage, you can add <code>-s/--single-item</code> to the
<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command.</p>
<h3 id="how-do-you-manage-a-dvcyaml-file-for-a-project-thats-going-to-be-a-big-sparse-dag" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893487527749623859" target="_blank" rel="nofollow noopener noreferrer">How do you manage a <code>dvc.yaml</code> file for a project that's going to be a big, sparse DAG?</a><a href="#how-do-you-manage-a-dvcyaml-file-for-a-project-thats-going-to-be-a-big-sparse-dag" aria-label="how do you manage a dvcyaml file for a project thats going to be a big sparse dag permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This an awesome use case from @Ian!</p>
<p>Let's say we have this scenario:</p>
<ul>
<li>A new data set is delivered to you every day</li>
<li>It needs to be featurized (does not depend on previous days' data)</li>
<li>Subsequent stage depends on all days</li>
</ul>
<p>The recommended approach is to keep all of the previous days and use the
<code>foreach</code> syntax, which ensures your DAG still knows about all the previously
processed days:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">featurize</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token number">20210101</span>
<span class="token punctuation">-</span> <span class="token number">20210102</span>
<span class="token punctuation">-</span> <span class="token number">20210103</span>
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python featurize.py $<span class="token punctuation">{</span>item<span class="token punctuation">}</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> raw/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.csv
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> intermediate/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.csv
<span class="token key atrule">combine</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python combine.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> intermediate
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> combined.csv</code></pre></div>
<p>That way if you adjusted something in your featurize script, for example, it
would automatically reprocess every day's data.</p>
<h3 id="what-is-the-best-practice-for-capturing-and-saving-stdout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893903023355613214" target="_blank" rel="nofollow noopener noreferrer">What is the best practice for capturing and saving <code>stdout</code>?</a><a href="#what-is-the-best-practice-for-capturing-and-saving-stdout" aria-label="what is the best practice for capturing and saving stdout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The best practice when using DVC is to pipe each command <code>stdout</code> into a
different file with a unique name, like a timestamp, in a directory that becomes
the stage output.</p>
<p>If optimizing storage space is a concern, in case the <code>stdout</code> dumps grow a lot,
this is what we recommend.</p>
<p>Here's an example of what that might look like if you're using a tool like
<code>tee</code>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/train.py data/features model.pkl <span class="token punctuation">|</span> tee <span class="token punctuation">-</span>a 20211021_model.pkl
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/features
<span class="token punctuation">-</span> src/train.py
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> train.min_split
<span class="token punctuation">-</span> train.n_est
<span class="token punctuation">-</span> train.seed
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> models/20211026_model.pkl</code></pre></div>
<p>This will output the <code>stdout</code> from the train stage in the terminal and also save
it in a new file with the timestamp as part of the title.</p>
<p>That was a helpful question. Thanks @gregk0!</p>
<h3 id="there-is-a-file-in-our-pipeline-that-needs-to-be-manually-modified-and-then-used-as-the-input-to-other-stages-in-the-pipeline-what-would-be-the-best-approach-for-this-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/894577842363445308" target="_blank" rel="nofollow noopener noreferrer">There is a file in our pipeline that needs to be manually modified and then used as the input to other stages in the pipeline. What would be the best approach for this with DVC?</a><a href="#there-is-a-file-in-our-pipeline-that-needs-to-be-manually-modified-and-then-used-as-the-input-to-other-stages-in-the-pipeline-what-would-be-the-best-approach-for-this-with-dvc" aria-label="there is a file in our pipeline that needs to be manually modified and then used as the input to other stages in the pipeline what would be the best approach for this with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is another great use case. Thanks @omarelb!</p>
<p>Let's say that you have a process similar to this.</p>
<ul>
<li>Run the first stage of the pipeline, for example a stage called <code>cleaning</code></li>
<li>Inspect its output, <code>lexicon.txt</code>, and modify it if necessary</li>
<li>The modified version of <code>lexicon.txt</code> is then cached and used as input to
following stages of the pipeline</li>
</ul>
<p>You can copy the output and modify and commit it in the copied location so the
first stage and its output are separate from the modified file and subsequent
stages.</p>
<p>If you want to link the first stage to the rest of the pipeline, you could have
your 2nd stage be something like:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">manual</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
# To generate lexicon_modified.txt:
# 1. Run `cp lexicon.txt lexicon_modified.txt`.
# 2. Check and modify lexicon_modified.txt.
# 3. Run `dvc commit manual`.</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> lexicon.txt
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> lexicon_modified.txt</code></pre></div>
<p>To clarify, if you put that <code>manual</code> stage into your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it should
connect the whole pipeline. Each time you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> and the first stage
generates a new <code>lexicon.txt</code>, you will get
<code>ERROR: failed to reproduce 'dvc.yaml': output 'lexicon_modified.txt' does not exist</code>
because the manual stage doesn't generate the expected output.</p>
<p>You can then manually copy, modify, and commit your new <code>lexicon_modified.txt</code>
and run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> again to run the rest of the pipeline.</p>
<h3 id="what-is-the-workflow-if-i-want-to-remove-some-files-from-my-dataset-registry-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/895192983366942740" target="_blank" rel="nofollow noopener noreferrer">What is the workflow if I want to remove some files from my dataset registry with DVC?</a><a href="#what-is-the-workflow-if-i-want-to-remove-some-files-from-my-dataset-registry-with-dvc" aria-label="what is the workflow if i want to remove some files from my dataset registry with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In this case, assume that the data was added as a folder containing images,
which means that there is a single <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> for the whole folder. You don't need
to remove the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file that's tracking the data in that folder.</p>
<p>You can delete the files you want to remove and then re-add the folder using
<a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a>. Here's what an example of what that flow might look like.</p>
<ul>
<li>You <code>git clone</code> your data registry.</li>
<li>Then <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> your data.</li>
<li>Delete the files you want to remove.</li>
<li>Run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and <code>git commit</code> to save your changes.</li>
</ul>
<p>It should be faster to commit, as DVC won't re-add the files to the cache nor
will it try to hash them.</p>
<p>Good question @MadsO!</p>
<h3 id="we-want-to-access-a-private-git-repo-using-dvcapiread-in-a-docker-container-how-do-i-pass-the-credentials-to-dvc-so-that-we-can-read-dvc-files-from-this-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/894533078389784577" target="_blank" rel="nofollow noopener noreferrer">We want to access a private Git repo using <code>dvc.api.read()</code> in a Docker container. How do I pass the credentials to DVC so that we can read DVC files from this repo?</a><a href="#we-want-to-access-a-private-git-repo-using-dvcapiread-in-a-docker-container-how-do-i-pass-the-credentials-to-dvc-so-that-we-can-read-dvc-files-from-this-repo" aria-label="we want to access a private git repo using dvcapiread in a docker container how do i pass the credentials to dvc so that we can read dvc files from this repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great question about the API @dashmote!</p>
<p>There are a couple different ways to handle this.</p>
<p>The first option is to use SSH. You'll need to pass GitHub SSH keys into your
Docker container and use the <code>[email protected]:username/repo.git</code> URL format when
you call the API method.</p>
<p>The other option is to use HTTP. You need to use the
<code>https://username:[email protected]/username/repo.git</code> URL format when you call
the API method.</p>
<p>You could pass your credentials into your container as environment variables and
then do something like:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">username <span class="token operator">=</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">[</span><span class="token string">"GITHUB_USERNAME"</span><span class="token punctuation">]</span>
token <span class="token operator">=</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">[</span><span class="token string">"GITHUB_TOKEN"</span><span class="token punctuation">]</span>
dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"https://</span><span class="token interpolation"><span class="token punctuation">{</span>username<span class="token punctuation">}</span></span><span class="token string">:</span><span class="token interpolation"><span class="token punctuation">{</span>token<span class="token punctuation">}</span></span><span class="token string">/..."</span></span><span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span></code></pre></div>
<h3 id="is-there-a-clean-way-to-handle-multiple-models-in-the-same-repo-that-are-trained-using-the-same-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/895368479853649930" target="_blank" rel="nofollow noopener noreferrer">Is there a clean way to handle multiple models in the same repo that are trained using the same pipeline?</a><a href="#is-there-a-clean-way-to-handle-multiple-models-in-the-same-repo-that-are-trained-using-the-same-pipeline" aria-label="is there a clean way to handle multiple models in the same repo that are trained using the same pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let's say your project looks something like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">├── data
│ ├── customer_1
│ │ ├── input_data.txt
│ │ ├── input_params.yaml
│ │ └── output
│ │ └── model.pkl
│ └── customer_2
│ ├── input_data.txt
│ ├── input_params.yaml
│ └── output
│ └── model.pkl
├── dvc.lock
├── dvc.yaml
└── train_model.py</code></pre></div>
<p>The simplest way is to copy the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> into each model's separate directory,
like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">├── data
│ ├── customer_1
│ │ ├── input_data.txt
│ │ ├── input_params.yaml
│ │ ├── dvc.yaml
│ │ ├── dvc.lock
│ │ └── output
│ │ └── model.pkl
│ └── customer_2
│ ├── input_data.txt
│ ├── input_params.yaml
│ ├── dvc.yaml
│ ├── dvc.lock
│ └── output
│ └── model.pkl
└── train_model.py</code></pre></div>
<p>Another potential solution is try templating. We'll have a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in the
root of the project and add <code>vars</code> to define the model you want to train. Then
we'll update the <code>train</code> stage to use the <code>vars</code> like this:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">model_name</span><span class="token punctuation">:</span> <span class="token string">'customer_2'</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>model_name<span class="token punctuation">}</span>/input_data.txt
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>model_name<span class="token punctuation">}</span>/input_params.yaml<span class="token punctuation">:</span>
<span class="token punctuation">-</span> batch_size
<span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div>
<p>You can
<a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">learn more about templating in the docs</a>.
It essentially lets you add variables to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to dynamically set
values for your stages.</p>
<p>Thanks for the great question @omarelb!</p>
<hr>
<p><img src="https://media.giphy.com/media/26u4lOMA8JKSnL9Uk/giphy.gif" alt="My Work Is Done Reaction GIF by SpongeBob SquarePants"></p>
<p>At our November Office Hours Meetup we will be going over internal Kaggle
competitions and PyTorch Lightening integration.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/october-21-heartbeathttps://dvc.org/blog/october-21-heartbeatFri, 15 Oct 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This month we have been flooded with content from our Community. We are grateful
and inspired to keep serving you!</p>
<p><img src="https://media.giphy.com/media/xUA7aN1MTCZx97V1Ic/giphy.gif" alt="Thank you!"></p>
<h2 id="ricardo-manhães-savii-trying-to-turn-machine-learning-into-value" style="position:relative;">Ricardo Manhães Savii: Trying to turn Machine Learning into value<a href="#ricardo-manh%C3%A3es-savii-trying-to-turn-machine-learning-into-value" aria-label="ricardo manhães savii trying to turn machine learning into value permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If we can't turn machine learning into value, what good are we?
<a href="https://www.linkedin.com/in/ricardoms/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ricardo Manhães Savii</strong></a>
<a href="https://medium.com/@ricardosavii/trying-to-turn-machine-learning-into-value-de9f28cde056" target="_blank" rel="nofollow noopener noreferrer">wrote a piece in Medium</a>
where he tackles how to technically and visually define the steps to deliver an
Intelligent System with the same level of best practice maturity that software
development has today. He combines and synthesizes the ideas of some of the best
known thinkers in the space to build a thorough architecture of machine learning
best practices. You won't want to miss this post and wrap your head around these
diagrams!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 596.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bd6a61c1bdb9d432121f1a603588a9fa/39600/manhaes.png" alt="CI/CD for Machine Learning" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ricardo
Manhães Savii's Addendum to François
Chollet's](<a href="https://medium.com/@francois.chollet" target="_blank" rel="nofollow noopener noreferrer">https://medium.com/@francois.chollet</a>) figure on result of machine
learning
(<a href="https://medium.com/@ricardosavii/trying-to-turn-machine-learning-into-value-de9f28cde056" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="rappibank-how-to-build-an-efficient-machine-learning-project-workflow" style="position:relative;">RappiBank: How to build an efficient machine learning project workflow<a href="#rappibank-how-to-build-an-efficient-machine-learning-project-workflow" aria-label="rappibank how to build an efficient machine learning project workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Continuing the theme of ML workflow Complexity,
<a href="https://www.linkedin.com/in/data-box-science/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Baena</strong></a> wrote a
<a href="https://medium.com/rappibank/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b" target="_blank" rel="nofollow noopener noreferrer">great overview and tutorial piece</a>
outlining the challenges that his team at
<a href="https://bank.rappi.com.br/" target="_blank" rel="nofollow noopener noreferrer">RappiBank</a> encountered and found ways to solve with
DVC including:</p>
<ul>
<li>confusing experiment files with different names</li>
<li>disjointed messaging about training and models and dataset changes</li>
<li>holding in your head or own notes progress that is not visible to the rest of
the team</li>
<li>heavy run and re-run times without a modularized system</li>
</ul>
<p>Daniel shows how all of these things can be solved using DVC.🏆</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/rappibank/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)</h4>
<div class="elp-description">Daniel Baena's overview of common MLOps challenges encoutered at Rappi Bank and how they are solved with DVC.</div>
<div class="elp-link">https://medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-10-15/baena-f06654520af4066465ffd7982e0b0fea.jpeg" alt="How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)">
</div>
</a>
</section>
<p></p>
<h2 id="dagshub-production-oriented-work" style="position:relative;">DAGsHub: Production Oriented Work<a href="#dagshub-production-oriented-work" aria-label="dagshub production oriented work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Next up, <a href="https://twitter.com/barazida" target="_blank" rel="nofollow noopener noreferrer"><strong>Nir Barazida</strong></a> from
<a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a>
<a href="https://dagshub.com/docs/workshops/production_oriented_work/?utm_content=bufferef4d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer" target="_blank" rel="nofollow noopener noreferrer">created a video</a>
on Production-oriented work using a monorepo strategy and focusing on moving
from research to production-ready code using Git and DVC. If you are a data
scientist trying to wrap your head around going from your notebook to
production, this may help!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dagshub.com/docs/workshops/production_oriented_work/?utm_content=bufferef4d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Production-Oriented Work with Git, DVC and DAGsHub</h4>
<div class="elp-description">Nir Barazida's tutorial and video on who to use a monorepo strategy and go from your notebook to production-ready code.</div>
<div class="elp-link">https://dagshub.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-10-15/dagshub-aa036fbcd9874d7c399ca6ef36cfc846.jpg" alt="Production-Oriented Work with Git, DVC and DAGsHub">
</div>
</a>
</section>
<p></p>
<h2 id="ml-data-versioning-with-dvc-how-to-manage-machine-learning-data" style="position:relative;">ML Data Versioning with DVC: How to Manage Machine Learning Data<a href="#ml-data-versioning-with-dvc-how-to-manage-machine-learning-data" aria-label="ml data versioning with dvc how to manage machine learning data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/piotr-storo%C5%BCenko-438087128/" target="_blank" rel="nofollow noopener noreferrer"><strong>Piotr Storożenko</strong></a>
of <a href="https://appsilon.com/" target="_blank" rel="nofollow noopener noreferrer">Appsilon</a> wrote
<a href="https://appsilon.com/ml-data-versioning-with-dvc/" target="_blank" rel="nofollow noopener noreferrer">a great tutorial</a> taking
into account the many challenges data scientists and ML engineers encounter in
their data versioning efforts and how DVC solves them. Do these scenarios from
his article look familiar?</p>
<blockquote>
<p>Was it in <code>model_3final.pth</code> or <code>model_last.pth</code> that I used a bigger lerning
rate?</p>
<p>When did I start using data preprocessing, during <code>model_2a.pth</code> or
<code>model_2aa.pth</code></p>
<p>Is <code>model_7.pth</code> trained on the new dataset or on the old one?`</p>
<p>Oh, gosh, which set of parameters and data have I used to train <code>model_2.pth</code>?
It was pretty good in the end…”</p>
</blockquote>
<h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="raviraja-gantas-10-week-course-on-basic-mlops" style="position:relative;">Raviraja Ganta's 10-week course on Basic MLOps<a href="#raviraja-gantas-10-week-course-on-basic-mlops" aria-label="raviraja gantas 10 week course on basic mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Twitter and LinkedIn were a blaze in the last month when
<a href="https://www.linkedin.com/in/ravirajag/" target="_blank" rel="nofollow noopener noreferrer"><strong>Raviraja Ganta</strong></a> announced his
<a href="https://www.ravirajag.dev/blog/mlops-summary" target="_blank" rel="nofollow noopener noreferrer">10-Week Course</a> on MLOps basics.
This course is chock full of resoures and practical tutorials to build your
MLOps platform and knowledge. <a href="https://www.ravirajag.dev/blog/mlops-dvc" target="_blank" rel="nofollow noopener noreferrer">Week 3</a>
of the course is about DVC and its ability to solve your versioning and
reproducibility challenges. Be sure to check out
<a href="https://github.com/graviraja/MLOps-Basics" target="_blank" rel="nofollow noopener noreferrer">the course repo</a> as well!</p>
<p><a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer"><strong>MLOps Community</strong></a> is hosting him to speak about
his course on October 20th.
<a href="https://airtable.com/shrh5eGdEbcBsdEdq" target="_blank" rel="nofollow noopener noreferrer">Sign up to attend here!</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3cf1063b4d2bf22102e5a1e310032794/39600/ganta.png" alt="Raviraja Ganta's 10-Week MLOps Course" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Raviraja Ganta's 10-Week Course on MLOps Basics
(<a href="https://www.ravirajag.dev/blog/mlops-summary" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="josh-willis-video-on-covid-simulations-with-dvc" style="position:relative;">Josh Willis video on COVID simulations with DVC<a href="#josh-willis-video-on-covid-simulations-with-dvc" aria-label="josh willis video on covid simulations with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This week,
<a href="https://twitter.com/josh_wills/status/1441456258746249216" target="_blank" rel="nofollow noopener noreferrer">this Tweet comment</a>
led me to
<a href="https://mlconf.com/sessions/the-covid-scenario-pipeline-high-stakes-data-science/" target="_blank" rel="nofollow noopener noreferrer">this work</a>
by <a href="https://twitter.com/josh_wills" target="_blank" rel="nofollow noopener noreferrer"><strong>Josh Wills.</strong></a> Josh was tapped by
<a href="https://twitter.com/dpatil" target="_blank" rel="nofollow noopener noreferrer"><strong>DJ Patil</strong></a> to participate in some COVID
simulation research early on in the pandemic in which he used DVC. In his
presentation about the project, he tells of the tools he used and challenges of
the use case. Nice DVC shout out at 19:56! Ah, the fruits of a Twitter 🐇🕳!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/tu7N8M-jwPU?rel=0&%3B=&%3Bshowinfo=0%3B&start=10" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="september-office-hours-video-transfer-learning-with-milecia-mcgregor" style="position:relative;">September Office Hours Video: Transfer Learning with Milecia McGregor<a href="#september-office-hours-video-transfer-learning-with-milecia-mcgregor" aria-label="september office hours video transfer learning with milecia mcgregor permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you missed last month's Office Hours
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>, you can now
catch the video! <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia's</strong></a> presentation
was based on <a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">her blog post</a>
on the same topic: Using Experiments for Transfer Learning. If you're curious
about transfer learning in general, AlexNet and SqueezeNet in particular, or
using DVC experiments and checkpoints to track all that you do, this video's for
you!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RmJbyQ36zVk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="quoc-tien-au-continuously-learning-on-the-job-as-a-data-scientist" style="position:relative;">Quoc-Tien Au: Continuously Learning on the Job as a Data Scientist<a href="#quoc-tien-au-continuously-learning-on-the-job-as-a-data-scientist" aria-label="quoc tien au continuously learning on the job as a data scientist permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://towardsdatascience.com/the-what-where-and-how-about-continuously-learning-on-the-job-as-a-data-scientist-b0a31ea4ac48" target="_blank" rel="nofollow noopener noreferrer">This Towards Data Science</a>
article by <a href="https://www.linkedin.com/in/quoctienau/" target="_blank" rel="nofollow noopener noreferrer"><strong>Quoc-Tien Au</strong></a> entitled
"The What, Where, and How about continuously learning on the job as a data
scientist," speaks to some higher points on the need to have a mindset for
continuous learning in the Data Science field. It's packed with great thought
processes and resources on what to learn, where to learn, and how to keep
learning while still getting your work done. Who stuggles with this? 😅</p>
<p><img src="https://media.giphy.com/media/icJCVO3GPDbCvvfgpf/giphy.gif" alt="Thats Me I Am GIF by Ryn Dean"></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="amsterdam-off-site" style="position:relative;">Amsterdam Off-site<a href="#amsterdam-off-site" aria-label="amsterdam off site permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Most of our team members from Europe got together in Amsterdam recently for a
couple days of brainstorming and team bonding. They went on a Treasure Hunt, ate
Ramen (a favorite among our team) and had great discussions on how to make our
tools and our team even better! Pictured below from front of the room left,
going clockwise (to the back of the room and back up) are David Ortega, Helio
Machado, David de la Iglesia Castro, Laurens Duijvesteijn, Ruslan Kupriev
(hidden), Dmitry Petrov, Jelle Bouwman, Batuhan Taskaya,Svetlana Sachkovskaya,
and Paweł Redzyński.</p>
<p>Be sure to check out this section next month as our Americas team members will
meet in San Francisco!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/24ec369b65ff5da6f58b0ccfe4ec622d/03346/amsterdam.jpg" alt="Europe Iterative Team Members meet in Amsterdam" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Iterative Team Members meet in Amsterdam
(<a href="https://www.linkedin.com/in/gortegadavid/" target="_blank" rel="nofollow noopener noreferrer">Source: David Ortega</a>))</em></p>
<h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/jordanwweber/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jordan Weber</strong></a> joins us from Los
Angeles, California as our new Chief of Staff. She has previously held similar
roles at venture captial and FinTech firms. In Jordan's free time she enjoys
cooking, tennis, dance, and hiking! 🎾</p>
<p><a href="https://www.linkedin.com/in/kenthom/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ken Thom</strong></a> joins us from Palo Alto,
California as our new Director of Operations. His past work includes business
operations, product management, software and hardware development. In his spare
time he likes to spend time with his family, swim, ski, and hike! 🥾</p>
<p><a href="https://www.linkedin.com/in/jon-burdo-59730a83/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jon Burdo</strong></a> joins us from
Boston, Massachusetts as a Senior Software engineer. He's been working for the
past few years as a machine learnng engineer with a focus on NLP. In his last
role he used DVC and loved it, which is how he eventually ended up here! 🎉 In
his spare time, Jon likes learning about open source software, tinkering with
Linux, and inline skating.</p>
<p><a href="https://www.linkedin.com/in/stephroy1/" target="_blank" rel="nofollow noopener noreferrer"><strong>Stephanie Roy</strong></a> joins the team as a
Senior Software Engineer from Quebec, Canada. Our first Canadian team member!
She has previously worked at LogMeln on one of their mobile apps. In her spare
time she likes taking care of her plants in her indoor grow house, playing
roller derby, and discovering new things to watch, listen to and eat! 😋</p>
<p>Welcome to all our new team members! We are so glad you are here! 🙌🏼</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And wouldn't you know it? We're still hiring!
<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Software Engineer (ML, Labeling, Python)</li>
<li>Senior Software Engineer (ML, Labeling, Python)</li>
<li>Senior Software Engineer (ML, DevTools, Python)</li>
<li>Field Data Scientist / Sales Engineer</li>
<li>Developer Advocate (ML)</li>
<li>Director / VP of Engineering (ML, DevTools)</li>
<li>Director / VP of Product (ML, Data Infra, SaaS)</li>
<li>Head of Talent</li>
<li>Head of DevRel</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<p><img src="https://media.giphy.com/media/120jXUxrHF5QJ2/giphy.gif" alt="High Five Amy Poehler GIF"></p>
<h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are a few important docs updates you may want to take a look at this month!</p>
<h3 id="-pytorch-lightning" style="position:relative;">📖 PyTorch Lightning<a href="#-pytorch-lightning" aria-label=" pytorch lightning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We all have
<a href="https://www.linkedin.com/search/results/all/?keywords=ilia%20sirotkin&origin=RICH_QUERY_SUGGESTION&position=0&searchId=e7bb3154-797a-44a5-a209-90ffece95246&sid=GeC" target="_blank" rel="nofollow noopener noreferrer"><strong>Ilia Sirotkin</strong></a>
to thank for his contribution to our docs. He created the
<a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks/pytorch-lightning" target="_blank" rel="nofollow noopener noreferrer">PyTorch Lightning integration docs</a>
for all to use!</p>
<h3 id="-cml-with-dvc-guide" style="position:relative;">📖 CML with DVC guide:<a href="#-cml-with-dvc-guide" aria-label=" cml with dvc guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://cml.dev/doc/cml-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Our updated CML with DVC Guide</a> provides
updated code and streamlined information on Cloud Storage Provider credentials
and GitHub Actions set up.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML & DVC
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest
<span class="token key atrule">container</span><span class="token punctuation">:</span> docker<span class="token punctuation">:</span>//ghcr.io/iterative/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">fetch-depth</span><span class="token punctuation">:</span> <span class="token number">0</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install -r requirements.txt # Install dependencies
dvc pull data --run-cache # Pull data & run-cache from S3
dvc repro # Reproduce pipeline</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Create CML report
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
echo "## Metrics" >> report.md
dvc metrics diff master --show-md >> report.md</span>
<span class="token comment"># Publish confusion matrix diff</span>
echo "<span class="token comment">## Plots" >> report.md</span>
echo "<span class="token comment">### Class confusions" >> report.md</span>
dvc plots diff \
<span class="token punctuation">-</span><span class="token punctuation">-</span>target classes.csv \
<span class="token punctuation">-</span><span class="token punctuation">-</span>template confusion \
<span class="token punctuation">-</span>x actual \
<span class="token punctuation">-</span>y predicted \
<span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json
vl2png vega.json <span class="token punctuation">-</span>s 1.5 <span class="token punctuation">></span> plot.png
cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token comment"># Publish regularization function diff</span>
echo "<span class="token comment">### Effects of regularization" >> report.md</span>
dvc plots diff \
<span class="token punctuation">-</span><span class="token punctuation">-</span>target estimators.csv \
<span class="token punctuation">-</span>x Regularization \
<span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json
vl2png vega.json <span class="token punctuation">-</span>s 1.5 <span class="token punctuation">></span> plot.png
cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md
cml send<span class="token punctuation">-</span>comment report.md</code></pre></div>
<h3 id="-shtab" style="position:relative;">📖 Shtab<a href="#-shtab" aria-label=" shtab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Team member <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa-Luis</strong></a> has
<a href="https://docs.iterative.ai/shtab/" target="_blank" rel="nofollow noopener noreferrer">created a docs website</a> for his python tab-
completion script generator project <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer">shtab</a>.
For more info checkout
<a href="https://dvc.org/blog/shtab-completion-release" target="_blank" rel="nofollow noopener noreferrer">the original blog post</a> about it
as well.</p>
<h2 id="next-meetups" style="position:relative;">Next Meetups<a href="#next-meetups" aria-label="next meetups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For the second class of
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814336/" target="_blank" rel="nofollow noopener noreferrer">DVC Learn,</a>
join us to learn about getting started running experiments! This lesson will
include information on how to use our
<a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpoints</a>
feature as well. We look forward to seeing you there!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814336/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Learn - Getting Started with Running Experiments</h4>
<div class="elp-description">Milecia McGregor shows us how to get started with DVC Experiments and Checkpoints</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-10-15/dvc_learn-2c4f8bdab833cb821b246bc5a7d0e118.png" alt="DVC Learn - Getting Started with Running Experiments">
</div>
</a>
</section>
<p></p>
<p>Be sure to join us at the
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245/" target="_blank" rel="nofollow noopener noreferrer">November Office Hours Meetup,</a>
where <a href="https://www.linkedin.com/in/maykon-schots/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maykon Shots</strong></a> will talk
about how he used DVC and CML to create an internal Kaggle competition for his
team to arrive at their best models in their work for the largest bank in
Brazil.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245//" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML</h4>
<div class="elp-description">Maykon Shots shows us how he used DVC and CML to create an internal Kaggle competition for his team</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-10-15/office-hours-meetup-409c5ab48d208e9a9cdc6871fd4c0937.png" alt="DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML">
</div>
</a>
</section>
<p></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This month, it was exceedingly hard to pick just one Tweet. I'm leaving you with
one that ballooned our followers over the last month. But there have been many!
I encourage you to visit our newly created
<a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer"><em>Wall of Love ❤️</em></a>
to see all the beautiful Iterative tool love. 🛠❤️🤗</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Startups I'm *incredibly* bullish about: <a href="https://twitter.com/stripe">@Stripe</a>, <a href="https://twitter.com/Iterativeai">@IterativeAI</a>, <a href="https://twitter.com/huggingface">@HuggingFace</a>, and <a href="https://twitter.com/explosion_ai">@Explosion_AI</a>.<br><br>If you're an engineer/PM considering a career change (and it's that time of the year again, no? 😆)—but want to opt away from FAAMG, definitely consider one of the companies above.</p>— 👩💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1435256826375720964">September 7, 2021</a></blockquote>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/adding-data-to-build-a-more-generic-modelhttps://dvc.org/blog/adding-data-to-build-a-more-generic-modelTue, 05 Oct 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You might be in the middle of training a model and then the business problem
shifts. Now you have this model that has been going through the training process
with a specific dataset and you need to make the model more generic.</p>
<p>There's likely something that your model learned that can be useful on this new
dataset, so you might not have to restart the entire training process. We'll do
an example of updating a pre-trained model to use a broader dataset with DVC. By
the end of this, you should see how you can handle this quickly and start
running new experiments to get a more generic model.</p>
<h2 id="the-original-pre-trained-model" style="position:relative;">The original pre-trained model<a href="#the-original-pre-trained-model" aria-label="the original pre trained model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For this post, we'll be making a more generic image classifier by taking the
original dataset with bees and ants and adding cats and dogs to it. You can
clone <a href="https://github.com/iterative/pretrained-model-demo" target="_blank" rel="nofollow noopener noreferrer">this GitHub repo</a> to
get the current bees and ants model and check out
<a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">this post</a> on how we
experimented with both AlexNet and SqueezeNet to build this model.</p>
<p>So we're starting from our current bees and ants model and extending it to
classify dogs and cats as well. We'll start by adding some cats and dogs data to
our validation data and do some experiments with the current model to see how it
performs on generic data.</p>
<p>Then we'll add the cats and dogs data to the training data and watch how the
model improves as we run experiments.</p>
<h2 id="updating-the-dataset-with-dvc" style="position:relative;">Updating the dataset with DVC<a href="#updating-the-dataset-with-dvc" aria-label="updating the dataset with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To add the new cats and dogs dataset to the project, we'll use this DVC command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/dataset-registry blog/cats-dogs</span></code></pre></div>
<p>This downloads a sample dataset with images of cats and dogs. You can use this
command to download files or directories that are tracked by DVC or Git. This
command can be used from anywhere in the file system, as long as DVC is
installed.</p>
<p>This will make a new directory called <code>./cats-dogs/data/</code> that was downloaded
from the DVC remote and it has images for cats and dogs. Now we can slowly add
in the new data to the existing data.</p>
<p>We'll start by moving the <code>val</code> data for <code>cats</code> and <code>dogs</code> from the
<code>/cats-dogs/data/</code> directory to the corresponding directory in
<code>data/hymenoptera_data</code>.</p>
<p><em>Just a quick note, cats and dogs don't really belong in the <code>hymenoptera</code>
directory since that's specific to ants and bees, but it's the easiest and
fastest way to add the data for this tutorial.</em></p>
<p>With this new data in place, we can start training our model.</p>
<h2 id="running-new-experiments-with-generic-data" style="position:relative;">Running new experiments with generic data<a href="#running-new-experiments-with-generic-data" aria-label="running new experiments with generic data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>With the updated data, let's run an experiment on the model and see how good the
results are. To run a new experiment, open your terminal and make sure you have
a virtual environment enabled. Then run this command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>Once the training epochs are finished, run the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span>
<span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div>
<p>The <code>--no-timestamp</code> hides the timestamps from table. The <code>--includes-metrics</code>
option lets us choose which metrics we want to show in the table. The
<code>--includes-params</code> option does the same for hyperparameters. This gives us a
table that's easier to read quickly.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span>
</span> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.86885<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.46<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.31573<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3.7067<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
│ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09
│ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09
│ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09
├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09
</span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>You'll notice that the validation accuracy is really low. That's because the
training metrics are based on bees and ants while the validation metrics are
based on bees, ants, cats, and dogs. If we looked at the validation metrics by
class, they'd likely be better for bees and ants than cats and dogs.</p>
<p>That means we should probably add more data to the training dataset.</p>
<h2 id="adding-the-cats-data-to-the-training-dataset" style="position:relative;">Adding the cats data to the training dataset<a href="#adding-the-cats-data-to-the-training-dataset" aria-label="adding the cats data to the training dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Let's add the <code>train</code> data for <code>cats</code> to the corresponding directory in
<code>data/hymenoptera_data</code> and go through another experiment run with a different
learning rate. With this new data, we can run another experiment. One important
thing to note here is that we're using checkpoints in our experiments. That's
how we get the metrics for each training epoch.</p>
<p>If we want to run a fresh experiment that doesn't resume training from the last
epoch, we need to reset our experiment. That's what we're going to do with this
command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--reset</span></span></code></pre></div>
<p>This will reset all of the existing checkpoints and excute the training script.
Once it's finished, let's take a look at the metrics table with this command.
It's the same as the one we ran last time.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span>
<span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div>
<p>Now you'll have a table that shows both experiments and you can see how much
better the new one did with the <code>cats</code> data added.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span>
</span> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91389<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.20506<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.66306<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
│ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09
│ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09
│ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09
├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09
│ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09
│ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09
│ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09
├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09
</span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>There's another way you can look at the difference between the model before we
added the <code>cats</code> data and after. If you run this in your terminal, you'll get a
plot comparing the two experiments.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> exp-23593 exp-54e8a</span></code></pre></div>
<p>The <code>exp-23593</code> and <code>exp-54e8a</code> values are the ids for the experiments you want
to compare. You'll see a new file gets generated in the <code>dvc_plots</code> directory in
your project. That's where you'll find the <code>index.html</code> file you should open in
your browser. You'll see something similar to this.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dc01547ad9f771f11e39ba81d45658b3/39600/with-cats-data.png" alt="plots comparing the accuracy, validation accuracy, loss, and validation loss for all epochs of each experiment" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>There's a huge difference in the accuracy of our model after we've added this
additional data. Let's see if we can make it even better by adding the <code>dogs</code>
data.</p>
<h2 id="adding-the-dogs-data-to-the-training-dataset" style="position:relative;">Adding the dogs data to the training dataset<a href="#adding-the-dogs-data-to-the-training-dataset" aria-label="adding the dogs data to the training dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll add the <code>train</code> data for <code>dogs</code> to the corresponding directory in
<code>data/hymenoptera_data</code> just like we did for the <code>cats</code> data. Now we can run a
new experiment with all of the new data included. We'll still need to reset the
experiment like before, so run the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--reset</span></span></code></pre></div>
<p>Once the training epochs are finished, we can take one more look at that metrics
table.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span>
<span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div>
<p>Now we'll have all three experiments to compare.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span>
</span> ────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.8795<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.90667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.29302<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25752<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span>
│ ╓ c20220f [exp-82f70] 3 0.8795 0.90667 0.29302 0.25752 0.001 0.09
│ ╟ fcb5a0b 2 0.85915 0.92333 0.38274 0.25257 0.001 0.09
│ ╟ 3768821 1 0.80751 0.84667 0.47681 0.40228 0.001 0.09
├─╨ 7e1b8fb 0 0.64632 0.84 0.87301 0.46744 0.001 0.09
│ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09
│ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09
│ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09
├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09
│ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09
│ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09
│ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09
├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09
</span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>These results make sense for the experiments we've run. We're paying attention
to the validation accuracy here because this gives us a fair comparison of
what's happening as we add more data.</p>
<p>The first experiment's training metrics are for bees and ants. The second
experiment's training metrics are for bees, ants, and cats. And the third
experiment's training metrics are for all four classes. So we can't really
compare these metrics.</p>
<p>We can look at a comparison between the experiments with the <code>cats</code> data and
both the <code>cats</code> and <code>dogs</code> data.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> exp-23593 exp-54e8a exp-82f70</span></code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5dc719adacedff151914e4fb5b634557/39600/with-cats-and-dogs-data.png" alt="plot of differences between model with just cats data and model with both cats and dogs data" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>The results you see line up with what is expected for the validation metrics
based on how we added the data to the training set. Now you can keep running
experiments until you get your model tuned like you need it!</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When you want to change datasets quickly and start tracking how they affect our
model, using a DVC remote makes it easy to do so on different computers. You'll
be able to quickly upload and download GBs of data and see how changes affect
individual experiments.</p>
<p>If you need help with anything DVC or CML, make sure to
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">join our Discord community</a>! We're always
answering questions and having good conversations with everybody that shows up.</p>https://dvc.org/blog/september-21-community-gemshttps://dvc.org/blog/september-21-community-gemsThu, 30 Sep 2021 00:00:00 GMT<h3 id="is-there-a-way-to-share-data-across-multiple-on-premise-machines-so-that-users-can-train-models-individually" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/879718738163826698" target="_blank" rel="nofollow noopener noreferrer">Is there a way to share data across multiple on-premise machines so that users can train models individually?</a><a href="#is-there-a-way-to-share-data-across-multiple-on-premise-machines-so-that-users-can-train-models-individually" aria-label="is there a way to share data across multiple on premise machines so that users can train models individually permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a good scenario to try out one of these use cases:</p>
<ul>
<li><a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache" target="_blank" rel="nofollow noopener noreferrer">Configuring a DVC cache</a></li>
<li><a href="https://dvc.org/doc/use-cases/fast-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Sharing a development server</a></li>
</ul>
<p>You can have a single storage location mounted on each workstation to serve as a
central cache.</p>
<p>That way all of your machine learning engineers can work with the same data in a
central location.</p>
<p>Thanks for the question @fchpriani!</p>
<h3 id="if-we-change-the-remote-we-are-using-in-our-workspace-does-that-effect-where-dvc-pulls-and-pushes-data-to-for-all-historical-commits" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/882951655979622400" target="_blank" rel="nofollow noopener noreferrer">If we change the remote we are using in our workspace, does that effect where DVC pulls and pushes data to for all historical commits?</a><a href="#if-we-change-the-remote-we-are-using-in-our-workspace-does-that-effect-where-dvc-pulls-and-pushes-data-to-for-all-historical-commits" aria-label="if we change the remote we are using in our workspace does that effect where dvc pulls and pushes data to for all historical commits permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for bringing this up @mattlbeck!</p>
<p>Right now DVC just uses whichever remote is configured in a respective commit
that you've checked out.</p>
<p>To clarify things a bit more, if you run <code>dvc push/pull</code> in a workspace with a
new remote, that new remote will be used for <code>--all-branches</code>, <code>--all-tags</code>, and
<code>--all-commits</code>.</p>
<h3 id="is-there-a-command-to-execute-only-a-few-specific-stages-in-a-dvc-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/888054401640562698" target="_blank" rel="nofollow noopener noreferrer">Is there a command to execute only a few specific stages in a DVC pipeline?</a><a href="#is-there-a-command-to-execute-only-a-few-specific-stages-in-a-dvc-pipeline" aria-label="is there a command to execute only a few specific stages in a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can freeze the stages that you do not want to be executed.</p>
<p><a href="https://dvc.org/doc/command-reference/freeze"><code>dvc freeze</code></a> and <a href="https://dvc.org/doc/command-reference/unfreeze"><code>dvc unfreeze</code></a> help you do this. Or you can use
<a href="https://dvc.org/doc/command-reference/repro#--glob"><code>dvc repro --glob pattern*</code></a> together with <code>-s</code> to match the stages you want to
run.</p>
<p>Thanks for the question @LucZ!</p>
<h3 id="when-running-queued-experiments-is-it-expected-for-dvc-to-run-dvc-checkout-for-each-experiment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/883144885417431081" target="_blank" rel="nofollow noopener noreferrer">When running queued experiments, is it expected for DVC to run <code>dvc checkout</code> for each experiment?</a><a href="#when-running-queued-experiments-is-it-expected-for-dvc-to-run-dvc-checkout-for-each-experiment" aria-label="when running queued experiments is it expected for dvc to run dvc checkout for each experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This brings up a good point, so thanks @dmh!</p>
<p>If you usually run experiments with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, you'll notice that it doesn't
checkout any files. That's because the experiment is running in the current
workspace.</p>
<p>When you use <a href="https://dvc.org/doc/command-reference/exp/run#--queue"><code>dvc exp run --queue</code></a> or <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a>, it runs each
experiment in its own separate temp workspace, so files have to be checked out
into those workspaces. Check out the notes in
<a href="https://dvc.org/doc/command-reference/exp/run#queueing-and-parallel-execution" target="_blank" rel="nofollow noopener noreferrer">this reference doc on queueing and parallel execution</a>
for more details.</p>
<h3 id="when-working-with-a-data-registry-is-it-possible-to-pull-a-specific-project-folder-modify-it-then-push-git-changes-and-dvc-push-to-the-remote-storage-without-pulling-data-from-all-the-directories" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/887427010044002345" target="_blank" rel="nofollow noopener noreferrer">When working with a data registry, is it possible to pull a specific project folder, modify it, then push Git changes and <code>dvc push</code> to the remote storage without pulling data from all the directories?</a><a href="#when-working-with-a-data-registry-is-it-possible-to-pull-a-specific-project-folder-modify-it-then-push-git-changes-and-dvc-push-to-the-remote-storage-without-pulling-data-from-all-the-directories" aria-label="when working with a data registry is it possible to pull a specific project folder modify it then push git changes and dvc push to the remote storage without pulling data from all the directories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is definitely possible. The most common way to handle this is by working in
the specific folder. You can <a href="https://dvc.org/doc/command-reference/pull#-R"><code>dvc pull -R</code></a> from the sub-directory, then make
your changes in the sub-directory, and <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> the changes. Then you can do a
<code>git commit</code> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to manage those changes.</p>
<p>You can also use a Git sub-repo and a DVC sub-repo to do this if each folder has
a distinct project. Use <code>git init</code> and <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a> in the project folders and
then you can pull them down, modify, commit and push commit back.</p>
<p>Really good question @ross.tsenov!</p>
<h3 id="is-it-possible-to-auto-generate-reports-with-metrics-and-plots-by-running-dvc-in-a-cml-job-when-the-data-is-stored-in-aws-bucket-instead-of-github" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/877072469188575262" target="_blank" rel="nofollow noopener noreferrer">Is it possible to auto-generate reports with metrics and plots by running DVC in a CML job when the data is stored in AWS bucket instead of GitHub?</a><a href="#is-it-possible-to-auto-generate-reports-with-metrics-and-plots-by-running-dvc-in-a-cml-job-when-the-data-is-stored-in-aws-bucket-instead-of-github" aria-label="is it possible to auto generate reports with metrics and plots by running dvc in a cml job when the data is stored in aws bucket instead of github permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for asking @Masmoudi!</p>
<p>When you need to retrieve data, you can run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to get it from the S3
bucket. If you run into an error with this, try adding
<code>uses: iterative/setup-dvc@v1</code> to the <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> command. This could happen
because the default CML action doesn't install DVC.</p>
<p>If you want more details on how CML works in GitHub, check out
<a href="https://cml.dev/doc/start/github#the-cml-github-action" target="_blank" rel="nofollow noopener noreferrer">the docs</a>!</p>
<h3 id="what-mechanism-can-i-use-in-gitlab-to-trigger-a-ci-pipeline-periodically-so-that-models-get-re-trained-and-logged-to-dvc-automatically" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/887306645883990037" target="_blank" rel="nofollow noopener noreferrer">What mechanism can I use in GitLab to trigger a CI pipeline periodically so that models get re-trained and logged to DVC automatically?</a><a href="#what-mechanism-can-i-use-in-gitlab-to-trigger-a-ci-pipeline-periodically-so-that-models-get-re-trained-and-logged-to-dvc-automatically" aria-label="what mechanism can i use in gitlab to trigger a ci pipeline periodically so that models get re trained and logged to dvc automatically permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can use
<a href="https://docs.gitlab.com/ee/ci/pipelines/schedules.html" target="_blank" rel="nofollow noopener noreferrer">pipeline schedules</a> to
train your model periodically and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> the results.</p>
<p>Good question @mihaj!</p>
<hr>
<p><img src="https://media.giphy.com/media/8UF0EXzsc0Ckg/giphy.gif" alt="Its Over GIF"></p>
<p>At our October Office Hours Meetup we will be going over how to get started with
data version control.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814318/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/refactorhttps://dvc.org/blog/refactorFri, 24 Sep 2021 00:00:00 GMT<p>It is common for big codebases to grow to a complexity where it is nearly
impossible for someone to tediously and flawlessly refactor things manually
everywhere. The main problem with existing automated solutions (such as
regex-based find-and-replace tools) is that they treat source code like a plain
text document. This often results in false positives (tools making changes when
they shouldn't) and/or false negatives (not changing what they should). This is
primarily due to a lack of ability to truly encapsulate structural concepts of
the programming language: syntax and grammar that are impossible to manifest in
regexes.</p>
<p>This is where <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree" target="_blank" rel="nofollow noopener noreferrer">AST</a>s shine.
They are the common building blocks of source code; produced by a parser that
actually understands the language's syntax and creates a tree object where
smaller parts (e.g. tokens) are ordered in a way that they are related by their
syntactical meanings.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">password <span class="token operator">=</span> <span class="token builtin">input</span><span class="token punctuation">(</span><span class="token string">"password? "</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> password <span class="token operator">==</span> secrets<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"my_password"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"correct"</span><span class="token punctuation">)</span>
<span class="token keyword">else</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"incorrect"</span><span class="token punctuation">)</span></code></pre></div>
<p>For example, the AST for the code above will look like this:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 592.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6ec34840bbe994244ffbd74bb2b65984/39600/ast.png" alt="Fundamentals of MLOps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Abstract Syntax
Tree</em></p>
<p>The top-most "root" node of this tree represents a single Python file. Each file
consists of a number of statements (e.g. function definitions, loops, etc.). For
our example we have only 2 statements: an assignment (to <code>password</code>), and an
<code>if</code> statement. Each of these statements in turn has child nodes as defined by
<a href="https://docs.python.org/3/library/ast.html#abstract-grammar" target="_blank" rel="nofollow noopener noreferrer">Python's ASDL</a>.</p>
<h2 id="refactoring-source-code-through-asts" style="position:relative;">Refactoring source code through ASTs<a href="#refactoring-source-code-through-asts" aria-label="refactoring source code through asts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://github.com/isidentical/refactor" target="_blank" rel="nofollow noopener noreferrer">Refactor</a> simplifies the process of
matching ASTs. It then applies your transformations to these ASTs without
touching the other parts of your source code.</p>
<p>For example, consider this code:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">foo <span class="token operator">=</span> <span class="token punctuation">[</span>
<span class="token number">1</span><span class="token punctuation">,</span>
<span class="token number">2</span>
<span class="token punctuation">]</span>
foo_2 <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'a'</span><span class="token punctuation">,</span> <span class="token operator">*</span>foo<span class="token punctuation">]</span>
<span class="token keyword">if</span> foo<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">>=</span> <span class="token number">1</span><span class="token punctuation">:</span>
<span class="token keyword">assert</span> secrets<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"foo"</span><span class="token punctuation">)</span> <span class="token operator">==</span> foo</code></pre></div>
<p>As a simple example, let's try to find and replace all instances of the <code>foo</code>
variable with <code>bar</code>… but without changing things inside strings or partial
matches like <code>foo_2</code>.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> ast
<span class="token keyword">import</span> refactor</code></pre></div>
<p>The first thing we need to do is define a rule. Each rule is a class that
defines a single entrypoint (<code>match())</code>), takes AST nodes from the tree, and
either rejects them (via raising an <code>AssertionError</code> or just returning <code>None</code>)
or accepts them (via returning a <code>refactor.Action</code>).</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">ReplaceFoo</span><span class="token punctuation">(</span>refactor<span class="token punctuation">.</span>Rule<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">def</span> <span class="token function">match</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> node<span class="token punctuation">)</span><span class="token punctuation">:</span></code></pre></div>
<p>Next, in the <code>match()</code> method, we will look for all <code>Name</code>s (which is what the
actual identifier is wrapped in), and check whether its <code>id</code> is <code>foo</code>.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">assert</span> <span class="token builtin">isinstance</span><span class="token punctuation">(</span>node<span class="token punctuation">,</span> ast<span class="token punctuation">.</span>Name<span class="token punctuation">)</span>
<span class="token keyword">assert</span> node<span class="token punctuation">.</span><span class="token builtin">id</span> <span class="token operator">==</span> <span class="token string">"foo"</span></code></pre></div>
<p>If any of these assertions fail, the function will terminate and the engine will
move to the next <code>node</code> in the tree. But if we have a match, we need to return
some sort of an action. The simplest thing we can return is a
<code>refactor.ReplacementAction</code> which takes this <code>node</code> and replaces it with the
given argument.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">return</span> refactor<span class="token punctuation">.</span>ReplacementAction<span class="token punctuation">(</span>
node<span class="token punctuation">,</span>
ast<span class="token punctuation">.</span>Name<span class="token punctuation">(</span><span class="token string">"bar"</span><span class="token punctuation">,</span> node<span class="token punctuation">.</span>ctx<span class="token punctuation">)</span>
<span class="token punctuation">)</span></code></pre></div>
<p>And that's it! To run this refactoring, we can simply create a CLI application
from our rules via <code>refactor.run()</code>:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
refactor<span class="token punctuation">.</span>run<span class="token punctuation">(</span>rules<span class="token operator">=</span><span class="token punctuation">[</span>ReplaceFoo<span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
<p>If we run it on the file above, we will get this <code>diff</code>:</p>
<div class="gatsby-highlight" data-language="diff"><pre class="language-diff"><code class="language-diff"><span class="token coord">@@ -1,9 +1,9 @@</span>
<span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>foo = [
</span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>bar = [
</span><span class="token unchanged"><span class="token prefix unchanged"> </span> 1,
<span class="token prefix unchanged"> </span> 2
<span class="token prefix unchanged"> </span>]
</span>
<span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>foo_2 = ['a', *foo]
</span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>foo_2 = ['a', *bar]
</span>
<span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>if foo[0] >= 1:
<span class="token prefix deleted">-</span> assert secrets.get("foo") == foo
</span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>if bar[0] >= 1:
<span class="token prefix inserted">+</span> assert secrets.get("foo") == bar</span></code></pre></div>
<p>All instances of the <code>foo</code> variable have been replaced, but items like <code>foo_2</code>
and <code>"foo"</code> are left alone as expected!</p>
<h2 id="going-deeper" style="position:relative;">Going Deeper<a href="#going-deeper" aria-label="going deeper permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Obviously not all refactorings are as simple as this, so <code>refactor</code> is equipped
with more features like different actions, observers and representatives for
context manager. If you are curious about these and more advanced features, be
sure to check out the
<a href="https://refactor.readthedocs.io/en/latest" target="_blank" rel="nofollow noopener noreferrer"><code>refactor</code> documentation</a>!</p>https://dvc.org/blog/september-21-dvc-heartbeathttps://dvc.org/blog/september-21-dvc-heartbeatTue, 14 Sep 2021 00:00:00 GMT<h1 id="this-months-head-turning-news-from-the-community" style="position:relative;">This month's head-turning News from the Community!<a href="#this-months-head-turning-news-from-the-community" aria-label="this months head turning news from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/1hWHUCgi3wKT6/giphy.gif?cid=ecf05e47a5sz6kvyp4h1swih08yokkbdfr39pq9pxscg975u&rid=giphy.gif&ct=g" alt="Head Turning Content from the DVC Community!"></p>
<h3 id="tezan-sahus-4-part-blog-series" style="position:relative;">Tezan Sahu's 4-part blog series<a href="#tezan-sahus-4-part-blog-series" aria-label="tezan sahus 4 part blog series permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Welcome to September! We'll kick off this month's Community picks with a
four-part series by <a href="https://twitter.com/SahuTezan" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong></a> on the
<a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-1-a-gentle-introduction-to-mlops-1b184d2c32a8" target="_blank" rel="nofollow noopener noreferrer"><strong>Fundamentals of MLOps.</strong></a>
Tehan introduces readers to the core ideas behind taking the best practices of
DevOps and how they are being adapted to machine learning projects that deploy
large scale AI powered applications. The series includes:</p>
<ul>
<li>Part 1:
<a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-1-a-gentle-introduction-to-mlops-1b184d2c32a8" target="_blank" rel="nofollow noopener noreferrer">A Gentle Introduction to MLOps</a></li>
<li>Part 2:
<a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-2-data-model-management-with-dvc-6be2ad284ec4" target="_blank" rel="nofollow noopener noreferrer">Data & Model Management with DVC</a>
We love this part best! ❤️😉</li>
<li>Part 3:
<a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-3-ml-experimentation-using-pycaret-747f14e4c28d" target="_blank" rel="nofollow noopener noreferrer">MLExperimentation with PyCaret</a></li>
<li>Part 4:
<a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-4-tracking-with-mlflow-deployment-with-fastapi-61614115436" target="_blank" rel="nofollow noopener noreferrer">Tracking with MLFlow & Deployment with Fast API</a></li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f7a737bfd6e8b7a186ba2775d773d571/39600/tezan-sahu.png" alt="Fundamentals of MLOps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Tezan
Sahu's 4 part series on the Fundamentals of MLOps
<a href="https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<p>If you follow the steps through this series, you will learn how to build and
deploy an end-to-end ML project - all the steps leading to production!</p>
<h3 id="miguel-méndez-tutorial-on-dvc--mmdetection" style="position:relative;">Miguel Méndez' Tutorial on DVC + MMdetection<a href="#miguel-m%C3%A9ndez-tutorial-on-dvc--mmdetection" aria-label="miguel méndez tutorial on dvc mmdetection permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This month <a href="https://www.linkedin.com/in/miguel-mendez/" target="_blank" rel="nofollow noopener noreferrer">Miguel Méndez</a> of
<a href="https://www.gradiant.org/en//" target="_blank" rel="nofollow noopener noreferrer">Gradiant</a> brings us a guide on object detection
using the <a href="">MMdetection</a> framework in conjunction with DVC to design the
pipeline, version models and monitor training progress. This follows his
<a href="https://mmeendez8.github.io/2021/07/01/dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">first guide</a> covering
how to version your datasets with DVC, which we shared in the
<a href="https://dvc.org/blog/july-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">July Heartbeat.</a></p>
<p>In
<a href="https://mmeendez8.github.io/2021/08/30/mmdet-dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">this new guide,</a>
you'll gain a thorough understanding of the steps, have access to
<a href="https://github.com/mmeendez8/mmdetection_dvc" target="_blank" rel="nofollow noopener noreferrer">his repo</a> for the project, and
find his thoughts on scaling hyperparameter tuning through this
<a href="https://github.com/iterative/dvc/issues/5477#issuecomment-905440724" target="_blank" rel="nofollow noopener noreferrer">open issue</a>
about exeperiments that we are trying to resolve. Join the conversation! We'd
love your input!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4fac790212019b4e53a1735ed91feb92/39600/mmdetection.png" alt="DVC + MMdetection" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Miguel
Méndez' second guide in a series using DVC in an object detecton project
<a href="https://mmeendez8.github.io/2021/08/30/mmdet-dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="hrittik-roys-complete-intro-to-dvc" style="position:relative;">Hrittik Roy's Complete Intro to DVC<a href="#hrittik-roys-complete-intro-to-dvc" aria-label="hrittik roys complete intro to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It was just a few short months ago when <a href="">Hrittik Roy</a> joined us at his first
<a href="">DVC Office Hours</a>. Now he's written
<a href="https://dev.to/hrittikhere/dvc-git-for-data-a-complete-intro-4626" target="_blank" rel="nofollow noopener noreferrer">DVC (Git for Data): A Complete Tutorial</a>
on DVC and how it solves the challenges of ML engineers. In this piece he takes
you through set up, pipeline and versioning, experiments and sharing through our
built in shared caching, so that you and your teammates can reduce resource use
when focusing on a subset of datasets as you move through your project.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d38ded232422aedc9d34369951b99b33/39600/hrittik-roy.png" alt="DVC (Git for Data): A complete Intro" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Hrittik Roy's Complete Intro on DVC
<a href="https://dev.to/hrittikhere/dvc-git-for-data-a-complete-intro-4626" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="andrey-kurenkovs-curated-list-of-ai-newsletters" style="position:relative;">Andrey Kurenkov's curated list of AI Newsletters<a href="#andrey-kurenkovs-curated-list-of-ai-newsletters" aria-label="andrey kurenkovs curated list of ai newsletters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In case you missed it,
<a href="https://twitter.com/andrey_kurenkov?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor" target="_blank" rel="nofollow noopener noreferrer">Andy Kurenkov</a>
tweeted that he finally got around to writing about his list of 21 favorite AI
Newsletters. You can find the article
<a href="https://medium.com/@andreykurenkov/the-best-ai-newsletters-483dc75134b" target="_blank" rel="nofollow noopener noreferrer">right here.</a>
Be sure to check it out and get reading…</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">One PhD student’s curated list of 21 newsletters to help you keep up with AI news and research</h4>
<div class="elp-description">Andrey Kurenkov's curated list of the best AI newsletters</div>
<div class="elp-link">https://medium.com.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-09-14/andrey-wordcloud-021ffe734cdcce52fa574effb88fb851.png" alt="One PhD student’s curated list of 21 newsletters to help you keep up with AI news and research">
</div>
</a>
</section>
<p></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We know there were a lot of peeps out on holiday over the last month so let me
fill you in!</p>
<p><img src="https://media.giphy.com/media/lz7212bWGdZbkm30KJ/giphy.gif?cid=ecf05e47hg6at9zmqb1pglypfrzi6vrgdsbay6zgza7wmwwu&rid=giphy.gif&ct=g" alt="Grab the popcorn!"></p>
<h2 id="yes-thats-right-vs-code-extension-is-coming" style="position:relative;">Yes, that's right, VS Code extension is coming!<a href="#yes-thats-right-vs-code-extension-is-coming" aria-label="yes thats right vs code extension is coming permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/DynamicWebPaige" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey</a> let the cat out of the bag
<a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649?s=20" target="_blank" rel="nofollow noopener noreferrer">with her tweet</a>
about the developent of our VS Code extension for DVC. We're getting closer
every day! If you'd like to be a part of the beta testing (how could you not?)
<a href="https://t.co/F64H9yyDH9?amp=1" target="_blank" rel="nofollow noopener noreferrer">join us here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c04073dc47a6902eb849b9b44ffab032/39600/VSCode.png" alt="VS Code Extension for DVC" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Paige
Bailey let's the cat out of the bag
<a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649?s=20" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="-docs-updates" style="position:relative;">📖 Docs Updates<a href="#-docs-updates" aria-label=" docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As promised, we will be adding this section to the Heartbeat each month so that
you can stay in the know about the doc updates that will most impact your
workflows. You won't want to miss these…</p>
<h3 id="-fast-and-secure-data-caching-hub" style="position:relative;">📖 Fast and Secure Data Caching Hub<a href="#-fast-and-secure-data-caching-hub" aria-label=" fast and secure data caching hub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>First up, a new doc on our
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#fast-and-secure-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Fast and Secure Data Caching Hub.</a>
Checkout this doc to learn how DVC's built-in data caching lets you implement a
simple and efficient storage layer globally - FOR YOUR ENTIRE TEAM. This lets
you:</p>
<ul>
<li>⏱ Speed data transfers from massive object stores currently on the cloud</li>
<li>💰 Pay only for fast access to frequently-used data</li>
<li>🙅🏻♂️ Avoid extra downloads and duplicating data</li>
<li>⚡️ Switch data inputs fast (without re-downloading) on a shared server used
for machine learning experiments.</li>
</ul>
<p>Status: Must read. 📖</p>
<p><img src="https://dvc.org/2021-09-14/fcaching-57a95a2297f0fbd38a2625ae0177046b.gif" alt="Fast and Secure Data Cachin Hub">
<em>Fast and Secure Data Cachin Hub
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#fast-and-secure-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h3 id="-cicd-for-machine-learning" style="position:relative;">📖 CI/CD for Machine Learning<a href="#-cicd-for-machine-learning" aria-label=" cicd for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Is this your life?</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 612px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/101349ee0d2416d578b584e83f12ae55/39600/cicd4ml-0.png" alt="Rage Quit Job" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Is this your life?
<a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<p>Our latest doc,
<a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Continuous Integration and Deployment for Machine Learning,</a>
shows you how to move from the above chaos to CI/CD victory through:</p>
<ul>
<li>✅ Data validation</li>
<li>✅ Model validation</li>
<li>🎟 Provisioning</li>
<li>📈 Metrics</li>
</ul>
<p>Read the whole doc to learn how DVC and CML will enable you to run entire
experiments/research online and remove most of your managment headaches to look
more like this. 👇🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 561px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b000d95083794b7ebfb4e8b901881f1/39600/cicd4ml-1.png" alt="Traditional ML meets CI/CD" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Traditional ML meets CI/CD with DVC and CML
<a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h3 id="-need-to-clean-up-your-worksapce" style="position:relative;">📖 Need to Clean up Your Worksapce?<a href="#-need-to-clean-up-your-worksapce" aria-label=" need to clean up your worksapce permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/user-guide/experiment-management/cleaning-experiments" target="_blank" rel="nofollow noopener noreferrer">Cleaning Up Experiments</a>
has been made bright and shiny and new to do the same with your experiments. Be
sure to check it out!</p>
<h3 id="-hugging-face-integration-with-dvc-live" style="position:relative;">📖 Hugging Face Integration with DVC Live<a href="#-hugging-face-integration-with-dvc-live" aria-label=" hugging face integration with dvc live permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Hugging Face</a> fans now have an integration with
DVCLive! Checkout how to
<a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks/huggingface" target="_blank" rel="nofollow noopener noreferrer">get set up here!</a>
Thanks <a href="https://github.com/pacifikus" target="_blank" rel="nofollow noopener noreferrer">@pacifikus</a>, for the contribution! 🙏🏼</p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This Thursday at our
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280212578/" target="_blank" rel="nofollow noopener noreferrer">September Office Hours Meetup</a>,
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milicia McGregor</a> will be presenting her
tutorial on
<a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">Using Experiments For Transfer Learning.</a>
Join us on September 16th at 3:00 pm UTC! RSVP at this link below! 👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280212578/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Office Hours - Using Experiments For Transfer Learning</h4>
<div class="elp-description">Milecia McGregor shows how to use DVC experiment tracking to compare models in a transfer learning project</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-09-14/pretrained-models-67709ed24c45932295bf5818741399d6.png" alt="DVC Office Hours - Using Experiments For Transfer Learning">
</div>
</a>
</section>
<p></p>
<h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our August Meetup video is out, so if you weren't able to make it, you can catch
all the details on <a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhan's</a>
tutorial on
<a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">DVC + Streamlit = ❤️</a></p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/F318uN01v7M?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll be introducing some new team member next month, but we are still hiring.
So do checkout our open positions
<a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">here</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Front-End Engineer (TypeScript, Node, React)</li>
<li>Senior Software Engineer (ML, Dev Tools, Python)</li>
<li>Senior Software Engineer (ML, Data Infra, GoLang)</li>
<li>Machine Learning Engineer/Field Data Scientist</li>
<li>Developer Advocate (ML)</li>
<li>Director/VP of Engineering (ML, DevTools)</li>
<li>Director/VP of Product (ML, Data Infra, SaaS)</li>
<li>Director/VP of Operations/Chief of Staff</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last week this Tweet brought us another 300 Twitter followers, catapulting us
over 3000! Thanks Community for joining us on this MLOps ride! More to come! 🚀</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Startups I'm *incredibly* bullish about: <a href="https://twitter.com/stripe">@Stripe</a>, <a href="https://twitter.com/Iterativeai">@IterativeAI</a>, <a href="https://twitter.com/huggingface">@HuggingFace</a>, and <a href="https://twitter.com/explosion_ai">@Explosion_AI</a>.<br><br>If you're an engineer/PM considering a career change (and it's that time of the year again, no? 😆)—but want to opt away from FAAMG, definitely consider one of the companies above.</p>— 👩💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1435256826375720964">September 7, 2021</a></blockquote>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/road-to-hellhttps://dvc.org/blog/road-to-hellTue, 07 Sep 2021 00:00:00 GMT<p>Machine learning operations (MLOps) in the last year has emerged as a distinct
IT discipline for building machine learning (ML) or artificial intelligence (AI)
models. While at first blush that may seem like a viable method for automating
the building of AI models, in reality purveyors of MLOps platforms have a vested
interest in convincing organizations to acquire platforms that exist outside of
best DevOps practices that have already been proven to accelerate application
development.</p>
<p>AI models, however, are ultimately a software artifact like any other that needs
to be integrated within an application. The trouble with MLOps as it is most
often pursued today is data scientists are constructing AI models in almost
complete isolation from the rest of the organization. The hope is that somehow
when the AI model is completed it will magically be incorporated into an
application development workflow. Unfortunately for all concerned, the rate at
which applications are being developed using best DevOps practices rarely align
with the rate at which AI models are being constructed.</p>
<blockquote>
<p>"The trouble with MLOps as it is most often pursued today is data scientists
are constructing AI models in almost complete isolation from the rest of the
organization."</p>
</blockquote>
<p>The result is not only a lot of wasted time and effort, the rate at which
digital business transformation initiatives that depend on AI models are rolled
out becomes a significant competitive disadvantage. In effect, the road to AI
hell is paved with good MLOps intentions.</p>
<p>While working as a data scientist at Microsoft, I saw firsthand how machine
learning and AI was traditionally implemented in an isolated fashion. That
unsatisfactory experience led to the launch of opensource Data Version Control
(DVC) and Continuous Machine Learning (CML) tools that integrate ML workflows
into best practices for software development. Instead of creating a separate
proprietary AI platform that needs to be acquired and maintained, the goal needs
to be to extend traditional software tools such as Git, collaboration and
continuous integration/continuous delivery (CI/CD) platforms to meet the needs
of both developers and ML engineers. The entire ML stack needs to be reinvented
in a way that makes it accessible to every developer.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fd9d4dc199039488512e2fb94d4bd300/39600/dvc-studio.png" alt="DVC Studio" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>DVC and CML are open source tools, that now along with DVC Studio, streamline
the workflow of data scientists. They integrate ML workflows into current
practices for software development in a way that eliminates the need for many
features of proprietary AI platforms such as AWS SageMaker, Microsoft Azure ML
and Google Vertex AI by extending traditional software tools like Git and CI/CD
platforms to meet the needs of ML researchers and ML engineers. In essence, they
provide an open platform based on best DevOps practices to operationalize ML and
AI.</p>
<blockquote>
<p>"DVC and CML are open source tools that streamline the workflow of data
scientists. They integrate ML workflows into current practices for software
development in a way that eliminates the need for many features of proprietary
AI platforms such as AWS SageMaker, Microsoft Azure ML and Google Vertex AI by
extending traditional software tools like Git and CI/CD platforms to meet the
needs of ML researchers and ML engineers."</p>
</blockquote>
<p>MLOps is about operations and automation for ML and AI. It covers the entire
lifecycle of an ML process including labeling data, development, modeling, and
monitoring. Every ML/AI platform offers this functionality. However, our vision
for MLOps is different. We think it should be embedded within your DevOps
processes. It should be part of your engineering infrastructure, engineering
stack and engineering processes. ML requires additional tools. It’s just those
tools need to be incorporated into a larger toolchain.</p>
<p>The primary reason to do this is to interact more consistently with people from
the software engineering side and to reuse proven tools such as Git,
GitHub/GitLab and CI/CD systems. An ML silo that builds an AI model outside the
traditional application development process creates a divide that needs to be
bridged whenever a data scientist needs to collaborate with engineers. For
example, with a traditional AI platform, all the workflows are predefined. There
may be some opportunity to modify them, but for all intents and purposes, those
workflows are inflexible. That’s the wrong approach. Teams made up of data
scientists and developers should be able to define their own workflow based on
their business requirements and team preferences, just like they do today when
constructing any other software artifact. Rather than a platform forcing teams
to embrace a highly opinionated workflow, they can employ flexible tools such
Git, GitHub, and their existing CI tools as they see fit.</p>
<blockquote>
<p>"Teams made up of data scientists and developers should be able to define
their own workflow based on their business requirements and team preferences,
just like they today when constructing any other software artifact."</p>
</blockquote>
<h2 id="how-we-do-it" style="position:relative;">How We Do It<a href="#how-we-do-it" aria-label="how we do it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When it comes to software engineering, everything in a workflow is based on the
version of the artifact. However, when working with large data sets, that
approach doesn’t work because there is no data versioning with existing tools.
We extend existing DevOps tools so that developers can version code in addition
to ML models.</p>
<p>In addition to allowing for data and modeling versioning, we also align data
scientists to the CI/CD process. This enables the data scientist to share code
and data with other members of the team in a way that actually works on their
machines! That’s critical because code is typically run through a third-party
platform to determine if it will run in a production environment. There is no
way to bring data into this process, which means there’s no real way to
determine whether a model works before deploying it. There are no ways to show
metrics. There are no ways to compare your metrics with your production metrics.
In this scenario, everything needs to be instrumented to attach required plots
to test. That takes a lot of time. We enable multiple plot points to be tested.
Finally, we provide a place to visualize and analyze data other than employing
Microsoft Excel spreadsheets. We extend traditional software engineering
functionality by providing a better system to visualize data right on top of
your GitHub, GitLab or BitBucket user interface.</p>
<blockquote>
<p>"We believe an open source-based workflow based on version control and CI
tools will streamline machine learning in the same way software development
has already been modernized."</p>
</blockquote>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We believe an open source-based workflow based on version control and CI tools
will streamline machine learning in the same way software development has
already been modernized. If data scientists, engineers and developers can
accelerate the development of ML/AI models by reusing files, pipelines,
experiments and even entire models stored in a Git repository, the rate at which
AI will be infused into software will increase by several orders of magnitude
and, best of all, the road to AI hell is not taken.</p>
<hr>
<p><em>This post originally appeared in</em>
<a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" target="_blank" rel="nofollow noopener noreferrer">The New Stack.</a></p>https://dvc.org/blog/august-21-community-gemshttps://dvc.org/blog/august-21-community-gemsTue, 31 Aug 2021 00:00:00 GMT<h3 id="q-are-toml-files-supported-for-storing-model-metrics-and-displaying-them-via-dvc-metrics-show" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/865974923079319563" target="_blank" rel="nofollow noopener noreferrer">Q: Are TOML files supported for storing model metrics and displaying them via <code>dvc metrics show</code>?</a><a href="#q-are-toml-files-supported-for-storing-model-metrics-and-displaying-them-via-dvc-metrics-show" aria-label="q are toml files supported for storing model metrics and displaying them via dvc metrics show permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @naeljaneLiblikas!</p>
<p>DVC does not support TOML files for metrics. TOML files are supported for
parameters only at the moment.</p>
<p>We do have an <a href="https://github.com/iterative/dvc/issues/6402" target="_blank" rel="nofollow noopener noreferrer">open issue</a> for
this. Please feel free to add any comments or emojis to this issue so we know
how to prioritize it!</p>
<h3 id="q-is-there-a-way-to-store-the-results-of-the-experiments-table-in-a-csv-file" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/872554861340803092" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to store the results of the experiments table in a CSV file?</a><a href="#q-is-there-a-way-to-store-the-results-of-the-experiments-table-in-a-csv-file" aria-label="q is there a way to store the results of the experiments table in a csv file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Take a look at the <code>--show-json</code> option of <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. This will print the
table in JSON format and you can write a script to save it to another file.</p>
<p>We have an <a href="https://github.com/iterative/dvc/issues/5446" target="_blank" rel="nofollow noopener noreferrer">open feature request</a>
to add CSV support. Give us some feedback so we know how to prioritize this on
our roadmap!</p>
<p>There's another workaround you could test out using our Python API, just keep in
mind that it isn't public and it's not as user-friendly as it could be.
Although, you can try something like this:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> itertools
<span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api
exps <span class="token operator">=</span> itertools<span class="token punctuation">.</span>chain<span class="token punctuation">.</span>from_iterable<span class="token punctuation">(</span>dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>Repo<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>ls<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>values<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">get_exp_info</span><span class="token punctuation">(</span>exp<span class="token punctuation">)</span><span class="token punctuation">:</span>
exp_dict <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"exp"</span><span class="token punctuation">:</span> exp<span class="token punctuation">}</span>
<span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"params.yaml"</span><span class="token punctuation">,</span> rev<span class="token operator">=</span>exp<span class="token punctuation">)</span> <span class="token keyword">as</span> p<span class="token punctuation">:</span>
params <span class="token operator">=</span> yaml<span class="token punctuation">.</span>load<span class="token punctuation">(</span>p<span class="token punctuation">,</span> Loader<span class="token operator">=</span>yaml<span class="token punctuation">.</span>Loader<span class="token punctuation">)</span>
exp_dict<span class="token punctuation">.</span>update<span class="token punctuation">(</span>params<span class="token punctuation">)</span>
<span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"scores.json"</span><span class="token punctuation">,</span> rev<span class="token operator">=</span>exp<span class="token punctuation">)</span> <span class="token keyword">as</span> s<span class="token punctuation">:</span>
metrics <span class="token operator">=</span> json<span class="token punctuation">.</span>load<span class="token punctuation">(</span>s<span class="token punctuation">)</span>
exp_dict<span class="token punctuation">.</span>update<span class="token punctuation">(</span>metrics<span class="token punctuation">)</span>
<span class="token keyword">return</span> exp_dict
exps_list <span class="token operator">=</span> <span class="token punctuation">[</span>get_exp_info<span class="token punctuation">(</span>exp<span class="token punctuation">)</span> <span class="token keyword">for</span> exp <span class="token keyword">in</span> exps<span class="token punctuation">]</span>
df <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">.</span>from_records<span class="token punctuation">(</span>exps_list<span class="token punctuation">)</span></code></pre></div>
<p>Great question @Jess_!</p>
<h3 id="q-is-there-a-recommended-way-to-specify-multiple-pipelines-in-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/864230750325047316" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a recommended way to specify multiple pipelines in DVC?</a><a href="#q-is-there-a-recommended-way-to-specify-multiple-pipelines-in-dvc" aria-label="q is there a recommended way to specify multiple pipelines in dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You'll want to keep each pipeline in a separate <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> if you want to work
with multiple pipelines. This is a recommendation and is not required to specify
different pipelines. Here's a bit of explanation:</p>
<ul>
<li>Splitting a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file into multiple files is encouraged where there are
clear logical groupings between stages. It avoids confusion, improves
readability, and shortens commands by avoiding long paths preceding every
filename.</li>
<li><a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files can be in any sub-directory or nested sub-directory in the
project structure and DVC will find them.</li>
<li>DVC will process them just the same as if they were one DVC file i.e.
dependencies between stages in different <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files are still respected.</li>
<li>Each <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file will have its own <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file in the same directory.</li>
</ul>
<p>If you want to see the rest of the explanation,
<a href="https://github.com/iterative/dvc.org/issues/2494" target="_blank" rel="nofollow noopener noreferrer">check out this user guide PR we have up</a>.
Please feel free to add a comment or emoji on this PR so we know how to
prioritize this content for you!</p>
<p>Thanks @Tups!</p>
<h3 id="q-is-there-way-to-allow-different-pipelines-to-have-common-dependencies-and-outputs-in-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/867747202306146335" target="_blank" rel="nofollow noopener noreferrer">Q: Is there way to allow different pipelines to have common dependencies and outputs in DVC pipelines?</a><a href="#q-is-there-way-to-allow-different-pipelines-to-have-common-dependencies-and-outputs-in-dvc-pipelines" aria-label="q is there way to allow different pipelines to have common dependencies and outputs in dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question @vgodie!</p>
<p>It is possible to have overlapping dependencies, but not overlapping outputs.
Having overlapping outputs introduces uncertainty into DVC commands, like
<a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>.</p>
<p>Sometimes people want to have overlapping directory outputs (different stages
that wrote many different files in the same directory). They might have a series
of stages that append to the same file. In this case, we suggest creating new
files and combining them in a final stage so they are consistently written in
the same order.</p>
<h3 id="q-how-does-the-cml-runner-restart-workflows-if-its-been-shut-down-by-aws-eg-spot-instances" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/862641924200857660" target="_blank" rel="nofollow noopener noreferrer">Q: How does the CML runner restart workflows if it's been shut down by AWS (e.g. spot instances)?</a><a href="#q-how-does-the-cml-runner-restart-workflows-if-its-been-shut-down-by-aws-eg-spot-instances" aria-label="q how does the cml runner restart workflows if its been shut down by aws eg spot instances permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You shouldn't have to do anything. Spot instances sends a <code>SIGINT</code> that we
handle to restart the workflow. We have been supporting graceful shutdown by
unregistering runners for a while now.</p>
<p>The main difference now is that we restart workflows with unfinished jobs.</p>
<p>Thanks for such a good question @andee96!</p>
<h3 id="q-can-i-change-an-endpoint-that-is-being-or-does-cml-publish-always-save-the-artifacts-on-this-endpoint" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/864444303169421322" target="_blank" rel="nofollow noopener noreferrer">Q: Can I change an endpoint that is being? Or does <code>cml publish</code> always save the artifacts on this endpoint?</a><a href="#q-can-i-change-an-endpoint-that-is-being-or-does-cml-publish-always-save-the-artifacts-on-this-endpoint" aria-label="q can i change an endpoint that is being or does cml publish always save the artifacts on this endpoint permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Good question @Nwp8nice!</p>
<p>If you use GitLab you can use the <code>--native</code> option to upload to GitLab instead.</p>
<p>It would be nice to be able to offer an alternative link so if you're
interested, a PR for <a href="https://github.com/iterative/cml/issues/291" target="_blank" rel="nofollow noopener noreferrer">this issue</a>
would be awesome! 😊</p>
<h3 id="q-is-cml-used-for-creating-the-mlops-workflows-like-apache-airflow" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/866624571519664128" target="_blank" rel="nofollow noopener noreferrer">Q: Is CML used for creating the MLOps workflows, like Apache Airflow?</a><a href="#q-is-cml-used-for-creating-the-mlops-workflows-like-apache-airflow" aria-label="q is cml used for creating the mlops workflows like apache airflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a really good question @Ravi Kumar!</p>
<p>CML is intended to augment existing CI/CD engines like GitHub Actions or GitLab
CI/CD, not replace them. It's a lightweight wrapper and not a complete
replacement workflow ecosystem like Airflow. We don't like reinventing working
wheels.</p>
<h3 id="q-does-cml-have-the-ability-to-cope-with-long-running-instances-eg-launching-an-aws-instance-via-github-actions-that-lasts-more-than-72-hours" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/866730530262351873" target="_blank" rel="nofollow noopener noreferrer">Q: Does CML have the ability to cope with long-running instances, e.g. launching an AWS instance via GitHub Actions that lasts more than 72 hours?</a><a href="#q-does-cml-have-the-ability-to-cope-with-long-running-instances-eg-launching-an-aws-instance-via-github-actions-that-lasts-more-than-72-hours" aria-label="q does cml have the ability to cope with long running instances eg launching an aws instance via github actions that lasts more than 72 hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Once the GitHub Actions limit of 72 hours is reached for self-hosted runners,
CML will handle restarting the Action and reconnecting to the runner. Meanwhile,
on GitLab there is no time limit to circumvent for self-hosted runners.</p>
<p>Thanks @sergechuvakin!</p>
<hr>
<p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p>
<p>At our September Office Hours Meetup we will be doing a live demo of running
experiments to fine-tune an existing model to work on a different dataset.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/transfer-learning-experimentshttps://dvc.org/blog/transfer-learning-experimentsTue, 24 Aug 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are plenty of machine learning models available that have been trained to
solve one problem and the knowledge gained from that can be applied to a new,
yet related problem. For example, a model like AlexNet has been trained on
millions of images so you could potentially use this to classify cars, animals,
or even people. This is called
<a href="https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a" target="_blank" rel="nofollow noopener noreferrer">transfer learning</a>
and it can save a lot of time on developing a model from scratch.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/S3Hm_BPLie0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>For us to take advantage of transfer learning, we can use fine-tuning to adopt
the model to our new problem. In many cases, we start by replacing the last
layer of the model. With the AlexNet example, this might mean the last layer was
previously used to classify cars but our new problem is classifying animals.</p>
<p>Even though we already have the bulk of the model defined, we'll still have to
do some experimentation to determine whether we need to replace more layers in
the model or if any other changes need to be made.</p>
<p>In this post, we'll go through an example of fine-tuning
<a href="https://towardsdatascience.com/alexnet-the-architecture-that-challenged-cnns-e406d5297951" target="_blank" rel="nofollow noopener noreferrer">AlexNet</a>
and
<a href="https://towardsdatascience.com/review-squeezenet-image-classification-e7414825581a" target="_blank" rel="nofollow noopener noreferrer">SqueezeNet</a>
to classify bees and ants. We'll use DVC to handle experiments for us and we'll
compare the results of both models at the end.</p>
<h2 id="initialize-the-pre-trained-model" style="position:relative;">Initialize the pre-trained model<a href="#initialize-the-pre-trained-model" aria-label="initialize the pre trained model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll be fine-tuning the AlexNet model and the SqueezeNet model to classify
images of bees and ants. You can find the project we're working with in
<a href="https://github.com/iterative/pretrained-model-demo" target="_blank" rel="nofollow noopener noreferrer">this repo</a>, which is based
on the tutorial over at
<a href="https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html" target="_blank" rel="nofollow noopener noreferrer">this post</a>.</p>
<p>In the <code>pretrained_model_tuner.py</code> file, you'll find the code that defines both
the AlexNet and SqueezeNet models. We start by initializing these models so we
can get the number of model features and the input size we need for fine-tuning.</p>
<p>Since the project has everything we need to initialize the models, we can start
training and comparing the differences between them with the ants/bees dataset.
Running experiments to get the best tuning for each model can make it difficult
to see which changes led to a better result. That's why we will be using DVC to
track changes in the code and the data.</p>
<h2 id="adding-the-train-stage" style="position:relative;">Adding the train stage<a href="#adding-the-train-stage" aria-label="adding the train stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Stages in DVC let us define individual data processes and can be used to build
detailed machine learning pipelines. You have the ability to define the
different steps of model creation like preprocessing, featurization, and
training.</p>
<p>We currently have a <code>train</code> stage in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. If you take a look at
it, you'll see something like:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python pretrained_model_tuner.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data/hymenoptera_data
<span class="token punctuation">-</span> pretrained_model_tuner.py
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> lr
<span class="token punctuation">-</span> momentum
<span class="token punctuation">-</span> model_name
<span class="token punctuation">-</span> num_classes
<span class="token punctuation">-</span> batch_size
<span class="token punctuation">-</span> num_epochs
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">model.pt</span><span class="token punctuation">:</span>
<span class="token key atrule">checkpoint</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">live</span><span class="token punctuation">:</span>
<span class="token key atrule">results</span><span class="token punctuation">:</span>
<span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">html</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div>
<p>The reason we need this <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file is so DVC knows what to pay attention to
in our workflow. It will start managing data, understand which metrics to pay
attention to, and what the expected output for each step is.</p>
<p>You'll typically add stages to <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> using the <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command and
this is one of the ways you can add new stages or update existing ones.</p>
<p>With the <code>train</code> stage defined, let's look at where the metrics actually come
from in the code. If you open <code>pretrained_model_tuner</code>, you'll see a line where
we dump the accuracy and loss for the training epochs into the <code>results.json</code>
file. We're also saving the model on the epoch run and recording metrics for
each epoch using <code>dvclive</code> logging.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">if</span> phase <span class="token operator">==</span> <span class="token string">'train'</span><span class="token punctuation">:</span>
torch<span class="token punctuation">.</span>save<span class="token punctuation">(</span>model<span class="token punctuation">.</span>state_dict<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"model.pt"</span><span class="token punctuation">)</span>
dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'acc'</span><span class="token punctuation">,</span> epoch_acc<span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'loss'</span><span class="token punctuation">,</span> epoch_loss<span class="token punctuation">)</span>
dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'training_time'</span><span class="token punctuation">,</span> epoch_time_elapsed<span class="token punctuation">)</span>
<span class="token keyword">if</span> phase <span class="token operator">==</span> <span class="token string">'val'</span><span class="token punctuation">:</span>
dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'val_acc'</span><span class="token punctuation">,</span> epoch_acc<span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'val_loss'</span><span class="token punctuation">,</span> epoch_loss<span class="token punctuation">)</span>
val_acc_history<span class="token punctuation">.</span>append<span class="token punctuation">(</span>epoch_acc<span class="token punctuation">)</span>
dvclive<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
<p>This code is needed to let DVC access the metrics in the project because it will
read the metrics from the <code>dvclive.json</code> file.</p>
<p>Since we have several hyperparameters set in the <code>params.yaml</code>, we need to use
those values when we run the training stage. The following code makes the
hyperparameter values accessible in the <code>train</code> function.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"params.yaml"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
yaml<span class="token operator">=</span>YAML<span class="token punctuation">(</span>typ<span class="token operator">=</span><span class="token string">'safe'</span><span class="token punctuation">)</span>
params <span class="token operator">=</span> yaml<span class="token punctuation">.</span>load<span class="token punctuation">(</span>f<span class="token punctuation">)</span></code></pre></div>
<p>With all of this in place, we can finally start running experiments to fine-tune
the two models.</p>
<h2 id="fine-tuning-alexnet" style="position:relative;">Fine-tuning AlexNet<a href="#fine-tuning-alexnet" aria-label="fine tuning alexnet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You can find the code that initializes the AlexNet model in the
<code>initialize_model</code> function in <code>pretrained_model_tuner.py</code>. Since we have DVC
set up, we can jump straight into fine-tuning this model to see which
hyperparameters give us the best accuracy.</p>
<p>We'll run the first experiment with the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>This will execute the <code>pretrained_model_tuner.py</code> script and run for 5 epochs
since that's what we defined in <code>params.yaml</code>. When this finishes, you can check
out the metrics from this run with the current hyperparameter values.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div>
<p>You'll see a table similar to this.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>num_classes<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>batch_size<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>num_epochs<span class="token hide">**</span></span></span>
</span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> - <span class="token bold"><span class="token hide">**</span>4<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.92623<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.19567<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>229.18<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9085<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25145<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>2<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>8<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>01:58 PM<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>2<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>8<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span>
│ ╓ bf81637 [exp-a1f53] 02:05 PM 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet 2 8 5
│ ╟ 9ca3fb8 02:04 PM 3 0.89344 0.27423 178.34 0.90196 0.26965 0.001 0.09 alexnet 2 8 5
│ ╟ a34ead1 02:03 PM 2 0.87295 0.29018 127.36 0.9085 0.2796 0.001 0.09 alexnet 2 8 5
│ ╟ ae382c7 02:02 PM 1 0.89754 0.26993 76.419 0.89542 0.31113 0.001 0.09 alexnet 2 8 5
├─╨ a95260d 02:01 PM 0 0.73361 0.5271 25.71 0.86928 0.36408 0.001 0.09 alexnet 2 8 5
</span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Now let's update the hyperparameters and run another experiment. There are
several ways to do this with DVC:</p>
<ul>
<li>Change the hyperparameter values directly in <code>params.yaml</code></li>
<li>Update the values using the <code>--set-param</code> or the shorthand <code>-S</code> option on
<a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></li>
<li>Queue multiple experiments with different values using the <code>--queue</code> option on
<a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></li>
</ul>
<p>We'll do an example of each of these throughout the rest of this article.</p>
<p>Let's start by updating the hyperparameter values in <code>params.yaml</code>. You should
have these values in your file.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">lr</span><span class="token punctuation">:</span> <span class="token number">0.009</span>
<span class="token key atrule">momentum</span><span class="token punctuation">:</span> <span class="token number">0.017</span></code></pre></div>
<p>Now run another experiment with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>. To make the table more readable,
we're going to specify the parameters we want to show and take a look at the
metrics with:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token parameter variable">--include-params</span> lr,momentum,model_name</span></code></pre></div>
<p>Your table should look something like this now. Since we're using checkpoints,
note that we continue training additional epochs on top of your previous
experiment. You'll see what it takes to start training from scratch later.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>9<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91803<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.27989<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>228.59<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.82353<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.69077<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.009<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.017<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span>
│ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet
│ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet
│ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet
│ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet
│ ╟ a9f8dab (bf81637) 5 0.79508 0.72891 25.219 0.66667 1.0311 0.009 0.017 alexnet
│ ╓ bf81637 [exp-a1f53] 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet
│ ╟ 9ca3fb8 3 0.89344 0.27423 178.34 0.90196 0.26965 0.001 0.09 alexnet
│ ╟ a34ead1 2 0.87295 0.29018 127.36 0.9085 0.2796 0.001 0.09 alexnet
│ ╟ ae382c7 1 0.89754 0.26993 76.419 0.89542 0.31113 0.001 0.09 alexnet
├─╨ a95260d 0 0.73361 0.5271 25.71 0.86928 0.36408 0.001 0.09 alexnet
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Finding good values for hyperparameters can take a few iterations, even when
you're working with a pretrained model. So we'll run one more experiment to
fine-tune this AlexNet model. This time we'll do it using the <code>-S</code> option.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.025</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.5</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div>
<p>The updated table will have values similar to this.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>11<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.88525<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1.1355<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>76.799<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9085<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1.7642<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span>
│ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet
│ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet
│ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet
│ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet
│ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet
│ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet
│ ╟ a9f8dab (bf81637) 5 0.79508 0.72891 25.219 0.66667 1.0311 0.009 0.017 alexnet
│ ╓ bf81637 [exp-a1f53] 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet</span></code></pre></div>
<p>If you take a look at the metrics and the corresponding hyperparameter values,
you'll see which direction you should try next with your values. That's one way
we can use DVC to fine-tune AlexNet for this particular dataset.</p>
<h2 id="fine-tuning-squeezenet" style="position:relative;">Fine-tuning SqueezeNet<a href="#fine-tuning-squeezenet" aria-label="fine tuning squeezenet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We'll switch over to fine-tuning SqueezeNet now that you've seen how the process
works in DVC. You'll need to update the <code>model_name</code> hyperparameter in
<code>params.yaml</code> to <code>squeezenet</code> if you're following along. The other
hyperparameter values can stay the same for now.</p>
<p>This is a good time to note that DVC is not only tracking the changes of your
hyperparameters for each experiment, it also tracks any code changes and dataset
changes as well.</p>
<p>Let's run one experiment with <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a> just to show the difference
in the metrics between the two models. Remember, since we're using checkpoints
it continues training on top of the previous experiment. That's why we're using
the <code>--reset</code> option here so that we can start a fresh experiment for the new
model. You should see results similar to this in your table.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.85656<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.35667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>83.414<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87582<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.34273<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
│ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet
├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet
│ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet
│ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet
│ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet</span></code></pre></div>
<p>The newest experiment has an accuracy that's significantly different since we
switched models. That tells us that the hyperparameter values that were good for
AlexNet might not work the best for SqueezeNet.</p>
<p>So we'll need to run a few experiments to find the best hyperparameter values.
This time, we'll take advantage of queues in DVC to set up the experiments and
then run them at the same time. To set up a queue, we'll run this command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.0001</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.9</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div>
<p>Running this sets up an experiment for future execution so we'll go ahead a run
this command one more time with different values.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.001</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.09</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div>
<p>You can check out the details for the queues you have in place by looking at the
experiments table with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. You'll see something like this.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span>
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.85656<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.35667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>83.414<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87582<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.34273<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
│ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet
├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet
│ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet
│ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet
│ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet
│ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet
│ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet
│ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet
...
├── *2df7fa5 - - - - - - 0.0001 0.9 squeezenet
├── *699dcae - - - - - - 0.001 0.09 squeezenet
</span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Then you can execute all of the queues with this command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span></span></code></pre></div>
<p>Now if you take a look at your table, you'll see the metrics from those 3
experiments.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.76639<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.49865<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>85.705<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.81699<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.4518<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span>
│ ╓ 699dcae [exp-8322f] 5 0.76639 0.49865 85.705 0.81699 0.4518 0.001 0.09 squeezenet
│ ╟ d26c25b (2df7fa5) 4 0.60246 0.68464 29.243 0.69935 0.55156 0.001 0.09 squeezenet
│ ╓ 2df7fa5 [exp-d1c65] 3 0.78689 0.488 83.929 0.83007 0.41527 0.0001 0.9 squeezenet
│ ╟ 05e1b41 (87ccd2e) 2 0.59016 0.76999 28.455 0.75163 0.49807 0.0001 0.9 squeezenet
│ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet
├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet
│ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet
│ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet
│ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet
│ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet</span></code></pre></div>
<p>Then you'll be able to make a decision on which way to go with your fine-tuning
efforts and make a decision on which model works best for your project. In this
case, it seems like SqueezeNet might be the winner!</p>
<p>You can take all of the DVC setup and apply this to your own custom fine-tuning
use case.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When you're working with pretrained models, it can be hard to fine-tune them to
give you the results you need. You might end up replacing the last layer of the
model to fit your problem or you might need to dig deeper. Then you have to
consider updating the hyperparameter values until you get the best model you
can.</p>
<p>That's why it's important to research tools that make this process more
efficient. Using DVC to help with this kind of experimentation will give you the
ability to reproduce any experiment you run, making it easier to collaborate
with others on a project. It will also help you keep track of what you've
already tried in previous experiments.</p>https://dvc.org/blog/august-21-dvc-heartbeathttps://dvc.org/blog/august-21-dvc-heartbeatTue, 17 Aug 2021 00:00:00 GMT<h1 id="its-all-about-that-data" style="position:relative;">It's all about that Data!<a href="#its-all-about-that-data" aria-label="its all about that data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/4FQMuOKR6zQRO/giphy.gif" alt="Data! Data! Data!"></p>
<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This month we are seeing the progression of a couple of pieces from the
<a href="https://media.giphy.com/media/62HBhssMOgdJUZQp1X/giphy.gif" target="_blank" rel="nofollow noopener noreferrer">June Heartbeat</a> as
well as checking out a use case, tool stack, and some great tutorials of our
Community members.</p>
<h2 id="lj-miranda-synthesizes-the-mlops-space-once-again" style="position:relative;">LJ Miranda synthesizes the MLOps space once again!<a href="#lj-miranda-synthesizes-the-mlops-space-once-again" aria-label="lj miranda synthesizes the mlops space once again permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/ljvmiranda921" target="_blank" rel="nofollow noopener noreferrer">LJ Miranda</a> writes another amazing article
after the series of articles he wrote covering the MLOps tools landscape we
covered in the June Heartbeat. This time he focuses on the wave of data-centric
focus taking over the space giving a review of the methods, approaches, and
techniques to ensure quality data for ML projects. If the adroit summaries of
complex concepts doesn't thrill you, the links to no less than 63 (😱) resources
will get you on your way to data-centric nirvana.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 662px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6cfa523455e454fb01e9f7fabb1cf96f/39600/lj-miranda-data-centric.png" alt="Data Centric Framework" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>LJ Miranda's Framework for putting data-centric machine learning into context
<a href="https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="neda-sultovas-comparison-of-dvc-mlflow-and-metaflow" style="position:relative;">Neda Sultova's Comparison of DVC, MLFlow and Metaflow<a href="#neda-sultovas-comparison-of-dvc-mlflow-and-metaflow" aria-label="neda sultovas comparison of dvc mlflow and metaflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Also covered in the June Hearbeat was
<a href="https://www.linkedin.com/in/neda-sultova-597a811a8/" target="_blank" rel="nofollow noopener noreferrer">Neda Sultova's</a> piece on
the rubric she is using to decide on the what MLOps tools to use for the teams
at <a href="https://www.helmholtz.ai/" target="_blank" rel="nofollow noopener noreferrer">Helmholtz AI</a>. This
<a href="https://medium.com/geekculture/comparing-metaflow-mlflow-and-dvc-e84be6db2e2" target="_blank" rel="nofollow noopener noreferrer">next article</a>
reviews her research into DVC, MLFlow and Metaflow and offers a thorough
analysis of the tools across multiple dimensions. Beyond the article, check out
her <a href="https://github.com/hzdr/mlops_comparison" target="_blank" rel="nofollow noopener noreferrer">MLOps Comparison repository</a> as
well as her
<a href="https://github.com/hzdr/mlops_comparison/blob/master/Content/Comparison_table.pdf" target="_blank" rel="nofollow noopener noreferrer">Comparison Table</a>.
They will not disappoint!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 454px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/24f89675d24a0316c700db40eee9b0f2/39600/neda-sultova-2.png" alt="Machine Learning Lifecycle" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Machine Learning Lifecycle
<a href="https://medium.com/geekculture/comparing-metaflow-mlflow-and-dvc-e84be6db2e2" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="amit-kulkarnis-tutorials" style="position:relative;">Amit Kulkarni's Tutorials<a href="#amit-kulkarnis-tutorials" aria-label="amit kulkarnis tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Writing for the
<a href="https://datahack.analyticsvidhya.com/contest/data-science-blogathon-9/#LeaderBoard" target="_blank" rel="nofollow noopener noreferrer">Analytics Vidhya Data Science Blogathon,</a>
<a href="https://www.linkedin.com/in/amitvkulkarni2/" target="_blank" rel="nofollow noopener noreferrer">Amit Kulkarni</a> created two
tutorials on DVC.
<a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/?utm_source=dlvr.it&utm_medium=twitter" target="_blank" rel="nofollow noopener noreferrer">Tracking ML Experiments with Data Version Control</a>
reviews DVC and takes you through getting started, setup, fetching data and
pre-processing, and the steps of an ML project. Next it sets up DVC, the
pipeline, and shows how to run model metrics and plots. In
<a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-versioning-datasets-with-git-dvc/" target="_blank" rel="nofollow noopener noreferrer">MLOps| Versioning with Git & DVC,</a>
Amit continues with an explanation how data and model versioning works with
Github paired with DVC.</p>
<p>In a previous article entitled
<a href="https://www.analyticsvidhya.com/blog/2021/04/bring-devops-to-data-science-with-continuous-mlops/" target="_blank" rel="nofollow noopener noreferrer">Bring DevOps to Data Science with MLOps</a>
Amit walks through a tutorial using CML to bring CI/CD functionality to your ML
project and automate the process. All great posts to check out!👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/?utm_source=dlvr.it&utm_medium=twitter" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Tracking ML Experiments With Data Version Control</h4>
<div class="elp-description">Amit Kulkarni's tutorial on getting started with DVC and tracking eperiments</div>
<div class="elp-link">https://analyticsvidhya.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="Tracking ML Experiments With Data Version Control">
</div>
</a>
</section>
<section class="elp-content-holder">
<a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-versioning-datasets-with-git-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps | Versioning Datasets with Git & DVC</h4>
<div class="elp-description">Amit Kulkarni's tutorial on how to DVC works with Git to version your datasets.</div>
<div class="elp-link">https://analyticsvidhya.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="MLOps | Versioning Datasets with Git & DVC">
</div>
</a>
</section>
<section class="elp-content-holder">
<a href="https://www.analyticsvidhya.com/blog/2021/04/bring-devops-to-data-science-with-continuous-mlops/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Bring DevOps To Data Science With MLOps</h4>
<div class="elp-description">Amit Kulkarni's tutorial on how to use CML to bring the CI/CD functionality of DevOps to your data science projects.</div>
<div class="elp-link">https://analyticsvidhya.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="Bring DevOps To Data Science With MLOps">
</div>
</a>
</section>
<p></p>
<h2 id="andreas-malekos-mlops-tool-stack-at-continuum-industries" style="position:relative;">Andreas Malekos' MLOps Tool Stack at Continuum Industries<a href="#andreas-malekos-mlops-tool-stack-at-continuum-industries" aria-label="andreas malekos mlops tool stack at continuum industries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last but not least, we bring you a great article from
<a href="https://www.linkedin.com/in/andreasmalekos/" target="_blank" rel="nofollow noopener noreferrer">Andreas Malekos</a>, Chief Scientist
at <a href="https://www.continuum.industries/" target="_blank" rel="nofollow noopener noreferrer">Continuum Industries</a>. In
<a href="https://neptune.ai/blog/mlops-tool-stack-continuum-industries" target="_blank" rel="nofollow noopener noreferrer">the post</a> he
outlines the tool stack and MLOps platform they use to do their work automating
and optimizing the design of linear infrastructure assets like water pipelines,
overhead transmission lines, subsea power lines, or telecommunication cables.</p>
<p>Amongst their tool stack are DVC and CML, and the article outlines what they
like (!🙈Spoiler alert🙊! DVC making repeatability achievable) and the things
that they don't like that still need to be improved.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 670.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ec7c9901d7fcae60af4221b7fc2796d2/39600/continuum-tool-stack.png" alt="Continuum Industries MLOps Tool Stack" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Continuum Industries MLOps Tool Stack
<a href="https://neptune.ai/wp-content../uploads/Continuum-Industries-tool-stack-final.png" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Though the team has been taking some vacation time in the last month, there's
still a lot going on!</p>
<p><img src="https://media.giphy.com/media/aNqEFrYVnsS52/giphy.gif" alt="Typing Cat"></p>
<h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This month we are introducing docs updates so that you will always be aware of
what has changed as our open source projects mature.</p>
<p>Our docs team made up of
<a href="https://www.linkedin.com/in/jorgeorpinel/" target="_blank" rel="nofollow noopener noreferrer">Jorge Orpinel</a>,
<a href="https://emresahin.net" target="_blank" rel="nofollow noopener noreferrer">Emre Şahin</a>, <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a>,
and
<a href="https://www.linkedin.com/in/david-de-la-iglesia-castro-b4b67b20a/" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia-Castro,</a>
has been hard at work updating our docs to make sure you have what you need to
be successful using our tools! Updates include:</p>
<ul>
<li>Complete <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive docs</a></li>
<li>We have a new <a href="https://dvc.org/doc/user-guide/glossary" target="_blank" rel="nofollow noopener noreferrer">Glossary page</a> and a
first Basic Concepts page
(<a href="https://dvc.org/doc/user-guide/basic-concepts/workspace" target="_blank" rel="nofollow noopener noreferrer"><em>DVC Workspace</em></a>)</li>
<li><a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">CML Docs migration to CML.Dev</a></li>
<li><a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">Added Videos to Get Started: Metrics and Experiments pages</a>
and
<a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">Checkpoints Guide</a></li>
<li>Authentication examples for
<a href="https://dvc.org/doc/command-reference/remote/modify#example-some-azure-authentication-methods" target="_blank" rel="nofollow noopener noreferrer">Azure Blob remote storage</a>
from Community member @meierale ❤️</li>
</ul>
<h2 id="batuhan-taskayas-refactor-project-hits-first-page-in-hackernews" style="position:relative;">Batuhan Taskaya's Refactor Project hits First Page in HackerNews!<a href="#batuhan-taskayas-refactor-project-hits-first-page-in-hackernews" aria-label="batuhan taskayas refactor project hits first page in hackernews permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A <a href="https://github.com/isidentical/refactor" target="_blank" rel="nofollow noopener noreferrer">Refactor Project</a> created by team
Member <a href="https://twitter.com/isidentical" target="_blank" rel="nofollow noopener noreferrer">Batuhan Taskaya</a> (AKA @isidentical),
was shared by someone on HackerNews and made it to the main page! You can
<a href="https://news.ycombinator.com/item?id=28027016" target="_blank" rel="nofollow noopener noreferrer">catch all the comments here</a>!</p>
<p>Explanation of the project:</p>
<blockquote>
<p>refactor is an end-to-end refactoring framework that is built on top of the
'simple but effective refactorings' assumption. It is much easier to write a
simple script with it rather than trying to figure out what sort of a regex
you need in order to replace a pattern (if it is even matchable with regexes).</p>
</blockquote>
<blockquote>
<p>Every refactoring rule offers a single entrypoint, match(), where they accept
an AST node (from the ast module in the standard library) and respond with
either returning an action to refactor or nothing. If the rule succeeds on the
input, then the returned action will build a replacement node and refactor
will simply replace the code segment that belong to the input with the new
version.</p>
</blockquote>
<p>Way to go Batuhan! 🚀</p>
<h2 id="july-office-hour-meetup" style="position:relative;">July Office Hour Meetup<a href="#july-office-hour-meetup" aria-label="july office hour meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you missed our July Office Hours, good news! It's now available on our
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube Channel</a> and
you can see <a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> shares about
{dvthis}, and how his team at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> uses DVC to
productionize rstats.</p>
<p>Also in the Meetup is a DVC Studio demo by
<a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Situala</a>, Senior
Product Engineer for Studio. You can catch the presentations along with great
questions and discussion from the Community!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/H22j1lWIvMw?rel=0&%3B=&%3Bshowinfo=0%3B&start=1546" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>So remember when I told you last month about DVC + Streamlit = ❤️ ? Well at our
August Office Hours Meetup,
<a href="https://www.linkedin.com/in/antoine-toubhans-92262119/" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> of
<a href="https://www.sicara.fr/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> will be presenting
<a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">his tutorial</a> on how to do
just that! Join us in the integrating fun on August 19th at 3:00 pm UTC! RSVP at
this link below! 👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Office Hours - DVC and Streamlit Integration</h4>
<div class="elp-description">Antoine Toubhans of Sicara shares his tutorial for using Streamlit with DVC to create a customizable web UI</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-08-17/streamlit-oh-0f211180ca12528deb0318d283d7886d.png" alt="DVC Office Hours - DVC and Streamlit Integration">
</div>
</a>
</section>
<p></p>
<h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This week's DVC Learn Meetup (August 18th) will be the last in our series of DVC
Learn Meetups designed to get teams up and running with DVC. We will digest our
learnings from this first cohort and revamp for the next set of three classes
that will begin in September. Subscribe to
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">our Meetup group</a> and
and follow us in <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and
<a href="https://www.linkedin.com/company/18657719" target="_blank" rel="nofollow noopener noreferrer">LinkedIn</a> to stay in the know about
all of our upcoming events!</p>
<p>If you are interested in weighing in on what kinds of educational content you
would like to see from us, we'd be grateful if you'd fill out
<a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer"><strong>this survey</strong></a>
to help us plan! 🙏🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 676.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/83fdf48f7311c67c558afe07fa5a639b/39600/survey.png" alt="DVC Online Course survey" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Help us
plan our Online Course! 🙏🏼
<a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Looking for a great opportunity at an amazing company? Check out our open
postions
<a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">at this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Front-End Engineer (TypeScript, Node, React)</li>
<li>Senior Software Engineer (ML, Dev Tools, Python)</li>
<li>Senior Software Engineer (ML, Data Infra, GoLang)</li>
<li>Machine Learning Engineer/Field Data Scientist</li>
<li>Developer Advocate (ML)</li>
<li>Director/VP of Engineering (ML, DevTools)</li>
<li>Director/VP of Product (ML, Data Infra, SaaS)</li>
<li>Director/VP of Operations/Chief of Staff</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/july-21-dvc-community-gemshttps://dvc.org/blog/july-21-dvc-community-gemsTue, 27 Jul 2021 00:00:00 GMT<h3 id="q-im-trying-to-use-the---reuse-option-of-cml-runner-if-i-launch-2-cml-experiments-in-parallel-will-cml-use-the-same-runner-or-spin-up-another-one-if-the-existing-one-is-in-use" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/850340190434492445" target="_blank" rel="nofollow noopener noreferrer">Q: I'm trying to use the <code>--reuse</code> option of <code>cml runner</code>. If I launch 2 CML experiments in parallel, will CML use the same runner or spin up another one if the existing one is in use?</a><a href="#q-im-trying-to-use-the---reuse-option-of-cml-runner-if-i-launch-2-cml-experiments-in-parallel-will-cml-use-the-same-runner-or-spin-up-another-one-if-the-existing-one-is-in-use" aria-label="q im trying to use the reuse option of cml runner if i launch 2 cml experiments in parallel will cml use the same runner or spin up another one if the existing one is in use permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you don't reuse the runner and you have set up a deploy job, that deploy job
will launch two cloud runners. With <code>--reuse</code> it will check if the runner with
that tag exists and will not launch another one. Every runner will be listening
for incomming jobs until the max idle time.</p>
<p>Let's say that you set up one runner with <code>--reuse</code> and launch multiple jobs.
What will happen is that only one runner should be launched and will take all
the jobs.</p>
<p>The runner that deploys the workflow is not tied specifically to the train job
that it's going to be launched in the same workflow. You just add runners to the
pool and they will be waiting until the idle time is done.</p>
<p>We're working on something like <code>--reuse-idle</code> that would be easy to implement.
The idea would be to reuse only idle runners, so that if your job fails and the
fix is pretty fast, you don't need to spin up another runner. You can track our
progress on that through
<a href="https://github.com/iterative/cml/issues/575" target="_blank" rel="nofollow noopener noreferrer">this GitHub issue</a>.</p>
<p>A great question from @Corentin in the Discord community!</p>
<h3 id="q-how-can-i-run-self-hosted-runners-on-an-on-premise-machine-indefinitely" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/851923384613994496" target="_blank" rel="nofollow noopener noreferrer">Q: How can I run self-hosted runners on an on-premise machine indefinitely?</a><a href="#q-how-can-i-run-self-hosted-runners-on-an-on-premise-machine-indefinitely" aria-label="q how can i run self hosted runners on an on premise machine indefinitely permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can achieve this by passing the <code>--idle-timeout=0</code> option to <code>cml runner</code> in
order to disable the timeout.</p>
<p>Thanks @achbogga!</p>
<h3 id="q-how-can-i-change-the-default-vpc-to-a-different-one-with-cml-runner-for-aws" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/857940793616498738" target="_blank" rel="nofollow noopener noreferrer">Q: How can I change the default VPC to a different one with <code>cml-runner</code> for AWS?</a><a href="#q-how-can-i-change-the-default-vpc-to-a-different-one-with-cml-runner-for-aws" aria-label="q how can i change the default vpc to a different one with cml runner for aws permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Great gem from @krish98409!</p>
<p>You could setting the security group via <code>cloud-aws-security-group</code>. It will
pick the VPC that manages that precise security group.</p>
<p>We still don't provide a way of specifying VPCs other than the default one, but
it's an issue that we're currently working on:
<a href="https://github.com/iterative/terraform-provider-iterative/issues/107" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/terraform-provider-iterative/issues/107</a></p>
<h3 id="q-is-it-possible-to-rename-and-modify-a-file-inside-a-directory-tracked-by-dvc-in-one-commitchange" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/849589484517588992" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to rename and modify a file inside a directory tracked by DVC in one commit/change?</a><a href="#q-is-it-possible-to-rename-and-modify-a-file-inside-a-directory-tracked-by-dvc-in-one-commitchange" aria-label="q is it possible to rename and modify a file inside a directory tracked by dvc in one commitchange permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you modify the name and modify the file, you just need to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a>
and then commit the change into Git.</p>
<p>This was a good question for everyone. Thanks @snowpong!</p>
<h3 id="q-how-can-i-list-the-experiments-ive-queued" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/856882434138570753" target="_blank" rel="nofollow noopener noreferrer">Q: How can I list the experiments I've queued?</a><a href="#q-how-can-i-list-the-experiments-ive-queued" aria-label="q how can i list the experiments ive queued permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a great question to help us all understand something so thanks
@adwivedi.</p>
<p>To look at your queued experiments, run <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. All of the queued
experiments will be marked with an asterisk <code>*</code>.</p>
<p><em>Queued experiments are not shown with the <a href="https://dvc.org/doc/command-reference/exp/list"><code>dvc exp list</code></a> command at the
moment.</em></p>
<h3 id="q-i-have-two-machines-and-a-central-remote-with-my-second-machine-i-want-to-pull-the-dataset-from-the-first-machine-how-can-i-pull-the-data-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/859034882297823233" target="_blank" rel="nofollow noopener noreferrer">Q: I have two machines and a central remote. With my second machine, I want to pull the dataset from the first machine. How can I pull the data with DVC?</a><a href="#q-i-have-two-machines-and-a-central-remote-with-my-second-machine-i-want-to-pull-the-dataset-from-the-first-machine-how-can-i-pull-the-data-with-dvc" aria-label="q i have two machines and a central remote with my second machine i want to pull the dataset from the first machine how can i pull the data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Make sure that you have configured a DVC remote and run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> from your
first machine. You should be able to find the files on the remote storage where
you pushed them to after running that command. Then you can run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> on
your second machine and this should give you the dataset you pushed from the
first machine.</p>
<p>You will run into some issues if your remote isn't configured properly on the
second machine. Check your <code>.dvc/config</code> file for the second machine to make
sure there aren't any errors. It could be something as simple as a connection
string without the necessary quotation marks!</p>
<p>Thanks so much for this question @raharth!</p>
<h3 id="q-dvc-push-says-everything-is-up-to-date-however-i-modified-my-dataset-and-this-is-confirmed-with-dvc-status-where-it-lists-a-modified-entry-on-the-changed-outs-how-can-i-force-a-push-of-my-changes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/857931383476977695" target="_blank" rel="nofollow noopener noreferrer">Q: <code>dvc push</code> says, "Everything is up to date." However, I modified my dataset and this is confirmed with <code>dvc status</code>, where it lists a "modified" entry on the changed outs. How can I force a push of my changes?</a><a href="#q-dvc-push-says-everything-is-up-to-date-however-i-modified-my-dataset-and-this-is-confirmed-with-dvc-status-where-it-lists-a-modified-entry-on-the-changed-outs-how-can-i-force-a-push-of-my-changes" aria-label="q dvc push says everything is up to date however i modified my dataset and this is confirmed with dvc status where it lists a modified entry on the changed outs how can i force a push of my changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You need to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> to commit your changes to the cache.</p>
<p>Good question @BSVogler.</p>
<h3 id="q-im-trying-to-use-the-dvc-api-in-a-jupyter-notebook-can-i-simulate-a-dvc-push-command-via-the-api" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/856979475068878898" target="_blank" rel="nofollow noopener noreferrer">Q: I'm trying to use the DVC API in a Jupyter notebook. Can I simulate a <code>dvc push</code> command via the API?</a><a href="#q-im-trying-to-use-the-dvc-api-in-a-jupyter-notebook-can-i-simulate-a-dvc-push-command-via-the-api" aria-label="q im trying to use the dvc api in a jupyter notebook can i simulate a dvc push command via the api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Nice job working with the Python API @harry134!</p>
<p>You can use the <code>Repo</code> API like this.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo
repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span>
repo<span class="token punctuation">.</span>push<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div>
<p>The API isn't production ready, so documentation is lacking at the moment.
Although, we do use it internally all the time, so you can use it with caution
too.</p>
<hr>
<p><img src="https://media.giphy.com/media/l0Iyl55kTeh71nTXy/giphy.gif" alt="Done GIF by Quizizz"></p>
<p>At our August Office Hours Meetup, we'll be learning about DVC and Streamlit
integration.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get answers for your
DVC and CML questions!</p>https://dvc.org/blog/hyperparam-tuninghttps://dvc.org/blog/hyperparam-tuningMon, 19 Jul 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>When you're starting to build a new machine learning model and you're deciding
on the model architecture, there are a number of issues that arise. You have to
monitor code changes you make, note any differences in the data you've used for
training, and keep up with hyperparameter value updates.</p>
<p>Being able to track all of these changes is important so that you can reproduce
your experiments without wondering which changes gave you the best model. You
can go back to any point in your experimenting process to see which changes gave
you the best results.</p>
<p>In this post, we're going to go through an example of hyperparameter tuning with
reproducibility using DVC. You can add this to any existing project you're
working on or start from a fresh project.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/W48Tvx2p-xE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="background-on-hyperparameters" style="position:relative;">Background on Hyperparameters<a href="#background-on-hyperparameters" aria-label="background on hyperparameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Before jumping straight into training and experiments, let's briefly go over
some background on hyperparameters.
<a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">Hyperparameters</a> are the values
that define your model. This includes things like the number of layers in a
neural network or the learning rate for gradient descent.</p>
<p>These parameters are different from model parameters because we can't get them
from training our model. They are used to <em>create</em> the model we train with.
Optimizing these values means running training steps for different kinds of
models to see how accurate the results are. We can get the best model from
iterating through different hyperparameter values and seeing how they effect our
accuracy.</p>
<p>That's why we do hyperparameter tuning. There are a couple common methods that
we'll do some code examples with: grid search and random search.</p>
<h2 id="tuning-with-dvc" style="position:relative;">Tuning with DVC<a href="#tuning-with-dvc" aria-label="tuning with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Let's start by talking about DVC a bit because we'll be using it to add
reproducibility to our tuning process. This is the tool we'll be using to track
changes in our data, code, and hyperparameters. With DVC, we can add some
automation to the tuning process and be able to find and restore any really good
models that emerge.</p>
<p>A few things DVC makes easier to do:</p>
<ul>
<li>Letting you make changes without worrying about finding them later</li>
<li>Onboarding other engineers to a project</li>
<li>Sharing experiments with other engineers on different machines</li>
</ul>
<p>For hyperparameter tuning, this means you can play with their values without
losing track of which changes made the best model and also have other engineers
take a look. We'll do an example of this with grid search in DVC first.</p>
<h2 id="working-with-a-dvc-project" style="position:relative;">Working with a DVC project<a href="#working-with-a-dvc-project" aria-label="working with a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We're going to be working with an existing NLP project. You can
<a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">get the code we're working with in this repo</a>.
It already has DVC set up, but you can check out
<a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">the Get Started docs</a> if you want to know how the
DVC pipeline was created.</p>
<p>First make sure you're in a virtual environment with a command similar to this.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> venv .venv</span></code></pre></div>
<p>After you've cloned the repo, install all of the dependencies with this command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</span></code></pre></div>
<p>You should be able to open your terminal and run an experiment with the
following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div>
<p>This will trigger the training process to run and it will record the
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html" target="_blank" rel="nofollow noopener noreferrer">ROC-AUC</a>
of your model. You can check out the results of your experiment with the
following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token parameter variable">--include-params</span> train.n_est,train.min_split</span></code></pre></div>
<p><em>We're adding a few options here to make the table view clearer. We aren't
showing timestamps and we're only looking at two hyperparameter values. You can
run <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> without the options to see the entire table.</em></p>
<p>This will produce a table similar to this.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.51682<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.93819<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>175<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>master<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.56447<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.94713<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span>
└── a1e8716 [exp-09074] 0.57333 0.94801 100 32
</span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div>
<h3 id="start-tuning-with-grid-search" style="position:relative;">Start tuning with grid search<a href="#start-tuning-with-grid-search" aria-label="start tuning with grid search permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<admon type="tip">
<p>Starting with DVC <code>2.25.0</code>, you can peform a Grid Search directly using
<code>exp run --set-param</code>. See the
<a href="https://dvc.org/doc/command-reference/exp/run#example-grid-search" target="_blank" rel="nofollow noopener noreferrer">example in the command reference</a>.</p>
</admon>
<p>Now that you've seen how to run an experiment, we're going to write a small
script to automate grid search for us using DVC. Using grid search in
hyperparameter tuning means you have an exhaustive list of hyperparameter values
you want to cycle through. Grid search will cover every combination of those
hyperparameter values.</p>
<p>We'll do this by creating queues. A queue is how DVC allows us to create
experiments that won't be run until later. That way we can cycle through
multiple hyperparameters quickly instead of manually updating a config file with
new hyperparameter values for each experiment run. The command syntax for
creating queues looks like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.min_split</span><span class="token operator">=</span><span class="token number">8</span></span></code></pre></div>
<p>In the example queue above, we're updating the <code>train.min_split</code> value that's
inside of the <code>params.yaml</code> file. This file holds all of the hyperparameter
values and is where DVC looks to determine if any values have changed. With the
command above, we're automatically updating that value in the <code>params.yaml</code>
using a queued experiment.</p>
<p>Now we can make the script. You can add a new file to the <code>src</code> directory called
<code>grid_search.py</code>. Inside of the file, add the following code.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> itertools
<span class="token keyword">import</span> subprocess
<span class="token comment"># Automated grid search experiments</span>
n_est_values <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">250</span><span class="token punctuation">,</span> <span class="token number">300</span><span class="token punctuation">,</span> <span class="token number">350</span><span class="token punctuation">,</span> <span class="token number">400</span><span class="token punctuation">,</span> <span class="token number">450</span><span class="token punctuation">,</span> <span class="token number">500</span><span class="token punctuation">]</span>
min_split_values <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">]</span>
<span class="token comment"># Iterate over all combinations of hyperparameter values.</span>
<span class="token keyword">for</span> n_est<span class="token punctuation">,</span> min_split <span class="token keyword">in</span> itertools<span class="token punctuation">.</span>product<span class="token punctuation">(</span>n_est_values<span class="token punctuation">,</span> min_split_values<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment"># Execute "dvc exp run --queue --set-param train.n_est=<n_est> --set-param train.min_split=<min_split>".</span>
subprocess<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"dvc"</span><span class="token punctuation">,</span> <span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">,</span> <span class="token string">"--queue"</span><span class="token punctuation">,</span>
<span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.n_est=</span><span class="token interpolation"><span class="token punctuation">{</span>n_est<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span>
<span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.min_split=</span><span class="token interpolation"><span class="token punctuation">{</span>min_split<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
<p>This is a simple grid search. We have two hyperparameters we want to tune:
<code>n_est</code> and <code>min_split</code>. So we have arrays with a few values in them to mimic
the exhaustive search a grid search can handle. Then we loop through the values
and create queued experiments for them using <code>subprocess</code>.</p>
<p>You can run this script now and generate your queue with this command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> src/grid_search.py</span></code></pre></div>
<p>You'll see some outputs in the terminal telling you that your experiments have
been queued. Then you can run them all with the following command.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span></span></code></pre></div>
<p>This will run every experiment that has been queued. Once all of those have run,
take a look at your metrics for each experiment.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--include-params</span><span class="token operator">=</span>train.min_split,train.n_est <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<p>Your table should look similar to this when you run the command above. We've
included the <code>--include-params</code> and <code>--no-timestamp</code> options to give us a table
that's easier to read.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>try-large-dataset<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span>
├── 4899d41 [exp-ae5ed] 0.6888 0.97028 8 250
├── bcdd8ed [exp-56613] 0.68733 0.96773 16 250
├── 703f20b [exp-caa84] 0.68942 0.9698 32 250
├── 1a882e6 [exp-c208f] 0.681 0.96772 64 250
├── 3ac33fb [exp-4c53e] 0.67775 0.96664 128 250
├── ea90ee0 [exp-fdb47] 0.65382 0.96719 256 250
├── b8277b1 [exp-3fb5c] 0.68547 0.97011 8 300
├── 7be641e [exp-3bbbc] 0.6883 0.96724 16 300
├── 4202757 [exp-38ca4] 0.68808 0.96968 32 300
├── b71ee2f [exp-5384b] 0.68111 0.96848 64 300
├── 1bbb0f4 [exp-f5d54] 0.67707 0.96753 128 300
├── 71ba159 [exp-31749] 0.65282 0.96752 256 300
├── 836c1c5 [exp-2ce0a] 0.68758 0.96998 8 350
├── dac9e22 [exp-5c799] 0.68778 0.96779 16 350</span></code></pre></div>
<p>Now you can see how your precision changed with each hyperparameter value
update. This is a quick implementation of grid search in DVC. You could read the
hyperparameter values from a different file or data source or make this tuning
script as fancy as you like. The main thing you need is the
<a href="https://dvc.org/doc/command-reference/exp/run#--queue"><code>dvc exp run --queue --set-param <param></code></a> command to execute when you add new
values.</p>
<h3 id="random-search" style="position:relative;">Random search<a href="#random-search" aria-label="random search permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another commonly used method for tuning hyperparameters is random search. This
takes random values for hyperparameters and builds the model with them. It
usually takes less time than an exhaustive grid search and it can perform better
if run for a similar amount of time as a grid search.</p>
<p>We're going to add a example of random search in a new file called
<code>random_search.py</code> simialr to the file we created for grid search. This will add
queued experiments with the randomly selected hyperparameter values. Add the
following code to <code>random_search.py</code>.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> subprocess
<span class="token keyword">import</span> random
<span class="token comment"># Automated random search experiments</span>
num_exps <span class="token operator">=</span> <span class="token number">10</span>
random<span class="token punctuation">.</span>seed<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>num_exps<span class="token punctuation">)</span><span class="token punctuation">:</span>
params <span class="token operator">=</span> <span class="token punctuation">{</span>
<span class="token string">"rand_n_est_value"</span><span class="token punctuation">:</span> random<span class="token punctuation">.</span>randint<span class="token punctuation">(</span><span class="token number">250</span><span class="token punctuation">,</span> <span class="token number">500</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token string">"rand_min_split_value"</span><span class="token punctuation">:</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token punctuation">}</span>
subprocess<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"dvc"</span><span class="token punctuation">,</span> <span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">,</span> <span class="token string">"--queue"</span><span class="token punctuation">,</span>
<span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.n_est=</span><span class="token interpolation"><span class="token punctuation">{</span>params<span class="token punctuation">[</span><span class="token string">'rand_n_est_value'</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span>
<span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.min_split=</span><span class="token interpolation"><span class="token punctuation">{</span>params<span class="token punctuation">[</span><span class="token string">'rand_min_split_value'</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div>
<p>This search could be far more complex with Bayesian optimization to handle the
hyperparameter value selections, but we're keeping it super simple by choosing
random numbers to focus on reproducibility. This will generate ten experiments
with random values for each hyperparameter.</p>
<p>You can run these new experiments with <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a> and then take a
look at the results with
<a href="https://dvc.org/doc/command-reference/exp/show#--include-params"><code>dvc exp show --include-params=train.min_split,train.n_est --no-timestamp</code></a>. Your
table should look something like this.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span>
</span> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span>
<span class="token bold"><span class="token hide">**</span>try-large-dataset<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span>
├── fc28c0c [exp-45902] 0.68358 0.96956 64 466
├── f13ac72 [exp-b9dfa] 0.68275 0.96914 64 444
├── a8cbc8f [exp-b0aeb] 0.68989 0.97003 32 260
├── 4791c52 [exp-5f2b5] 0.67711 0.96809 128 497
├── c5398e0 [exp-86c74] 0.6811 0.96829 64 374
├── db16c91 [exp-db50f] 0.68986 0.97073 32 485
├── 2dd08fa [exp-fee4f] 0.68262 0.96941 64 497
├── 18d2ec5 [exp-d73c7] 0.67696 0.96726 128 341
├── 1710032 [exp-dd198] 0.68756 0.9687 16 478
├── 4f0b80a [exp-746c1] 0.68724 0.96811 16 379</span></code></pre></div>
<p>This shows the difference in the randomly selected values and the values from
grid search. You might find a better value with random search because it jumps
around a range of values which might hit the optimum faster than it would with a
grid search.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>With the comparison between grid search and random search, you can see how
reproducibility can help you find the best model for your project. You'll be
able to see all of the hyperparameter changes and code changes that created each
model. This gives you the ability to fine tune your model because you can go to
any experiment and resume training with different values, code, or data.</p>https://dvc.org/blog/july-21-dvc-heartbeathttps://dvc.org/blog/july-21-dvc-heartbeatFri, 16 Jul 2021 00:00:00 GMT<h1 id="welcome-to-summer" style="position:relative;">Welcome to Summer!<a href="#welcome-to-summer" aria-label="welcome to summer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><img src="https://media.giphy.com/media/WuY9yfI89DbNu/giphy.gif" alt="It's summer!"></p>
<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><span style="color:purple"><strong>A</strong></span>s usual we have a ton of goodness from
the Community! Let's jump in!</p>
<h2 id="antoine-toubhans-post-combining-streamlit-and-dvc" style="position:relative;">Antoine Toubhans' Post Combining Streamlit and DVC!<a href="#antoine-toubhans-post-combining-streamlit-and-dvc" aria-label="antoine toubhans post combining streamlit and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/antoine-toubhans-92262119/" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> of
<a href="https://www.sicara.fr/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> wrote a fantastic and detailed tutorial
entitled
<a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer"><strong>How to Build Customizable Web UI for Machine Learning with Streamlit and DVC</strong></a>
bringing together the best of DVC and integrating it with Streamlit to provide a
customizable UI. The tutorial <span style="color:purple"><strong>g</strong></span>oes through
the steps of setting up a pipeline, spltting a dataset, training and evaluating
a model, tracking changes to data and model, dvc
<span style="color:purple"><strong>m</strong></span>etrics and plots and then bridging the
gap in visualizations using <span style="color:purple"><strong>S</strong></span>treamlit. You
won't want to miss this one!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca5a6eaba77575617f7935269fbefd1e/39600/streamlit2.png" alt="DVC and Streamlit" title="DVC and Streamlit" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC +
Streamlit = ♥️!
<a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="dvc-and-cml-in-japanese" style="position:relative;">DVC and CML in Japanese!<a href="#dvc-and-cml-in-japanese" aria-label="dvc and cml in japanese permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>For our friends that speak Japanese,
<a href="https://www.slideshare.net/yusukeshibui/testing-machine-learningdevelopment" target="_blank" rel="nofollow noopener noreferrer">these slides</a>
created by
<a href="https://www.slideshare.net/yusukeshibui?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview" target="_blank" rel="nofollow noopener noreferrer">Yusuke Shibui</a>
walk you through a machine learning to production project using
D<span style="color:purple"><strong>V</strong></span>C and
C<span style="color:purple"><strong>M</strong></span>L. We love seeing our tools being used
all around the world! 🌏</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/32aab98c003f28761779d2fb79c247af/39600/in-japanese.png" alt="DVC and CML in Japanese" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC
and CML in Japanese!
<a href="https://www.slideshare.net/yusukeshibui/testing-machine-learningdevelopment" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p>
<h2 id="miguel-méndez-dvc-tutorial" style="position:relative;">Miguel Méndez' DVC Tutorial<a href="#miguel-m%C3%A9ndez-dvc-tutorial" aria-label="miguel méndez dvc tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/miguel-mendez/" target="_blank" rel="nofollow noopener noreferrer">Miguel Méndez</a> and his team at
<a href="https://www.gradiant.org/en/" target="_blank" rel="nofollow noopener noreferrer">Gradiant</a>
<span style="color:purple"><strong>s</strong></span>truggled with reproducibility before
using DVC for versioning their image dataset and annotations. The dataset and
annotaions are held in a shared storage space and used by the whole team. DVC
enables the team to track changes and know what versions of the dataset produce
the best results. His tutorial walks you through the steps to set it up!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mmeendez8.github.io/2021/07/01/dvc-tutorial.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Version Control Your Dataset with DVC</h4>
<div class="elp-description">Miguel Méndez' tutorial on using DVC for versioning datasets and providing reproducibility</div>
<div class="elp-link">https://github.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-07-16/git-dvc-77f8c394ced19aec2e78228f20003fd6.png" alt="Version Control Your Dataset with DVC">
</div>
</a>
</section>
<p></p>
<h2 id="jobs-requiring-dvc" style="position:relative;">Jobs requiring DVC!<a href="#jobs-requiring-dvc" aria-label="jobs requiring dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have been seeing an uptick in the number of jobs requiring knowledge of DVC.
It's exciting to see that our tools are
helpin<span style="color:purple"><strong>g</strong></span> these companies in their MLOps
workflows! 🎉</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/58da1cf2344bc6fbbc522a2e67842e03/39600/job-descriptions.png" alt="job descriptions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>With all those DVC job opportunities out there, you
<span style="color:purple"><strong>b</strong></span>etter get on it! 😉</p>
<h2 id="a-new-udacity-course-incorporating-dvc" style="position:relative;">A New Udacity Course Incorporating DVC!<a href="#a-new-udacity-course-incorporating-dvc" aria-label="a new udacity course incorporating dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Just this month a new
<a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" target="_blank" rel="nofollow noopener noreferrer">Udacity</a>
nannodegree program came out entitled
<a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" target="_blank" rel="nofollow noopener noreferrer"><strong>Machine Learning DevOps Engineer</strong></a>,
that teaches DVC as part of the program. This course includes sections on:</p>
<ul>
<li>Clean Code Principles</li>
<li>Building a Reproducible <span style="color:purple"><strong>M</strong></span>odel Workflow</li>
<li>Deploying a Scalable ML Pipeline in Production</li>
<li>Automated Model Scoring and Monitoring</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Machine Learning DevOps Engineer</h4>
<div class="elp-description">A new nanodegree program offered by Udacity teaching DVC as part of the curriculum</div>
<div class="elp-link">https://udacity.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-07-16/udacity-26a2fd44369db81a4577a44b318f8559.png" alt="Machine Learning DevOps Engineer">
</div>
</a>
</section>
<p></p>
<h2 id="dvc-leaspan-stylecolorpurplerspann" style="position:relative;">DVC Lea<span style="color:purple"><strong>r</strong></span>n<a href="#dvc-leaspan-stylecolorpurplerspann" aria-label="dvc leaspan stylecolorpurplerspann permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This week we kicked off our new DVC Learn Meetup series with
<a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a>. This set of three,
short, half-hour classes are designed to get you up and running in DVC. If you
are just getting started with <span style="color:purple"><strong>D</strong></span>VC or
kicking the tires, this Meetup series is for you! Our next class on August 4th
will get you started with experiments.</p>
<p>If you are interested in weighing in on what kinds of educational content you
would like to see from us, we'd be grateful if you'd fill out
<a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer"><strong>this survey</strong></a>
to help us plan! 🙏🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279447414/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC Learn - Getting Started: Experiments</h4>
<div class="elp-description">The next DVC Learn Meetup taught by Melecia McGregor designed to get you started with DVC Experiments</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-07-16/dvc_learn-415db7d91a0670061d51698d6880fc57.png" alt="DVC Learn - Getting Started: Experiments">
</div>
</a>
</section>
<p></p>
<h2 id="data-science-journal-article-on-reproducibility-practices-in-research" style="position:relative;">Data Science Journal Article on Reproducibility Practices in Research<a href="#data-science-journal-article-on-reproducibility-practices-in-research" aria-label="data science journal article on reproducibility practices in research permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>New research presented in the
<a href="https://datascience.codata.org/" target="_blank" rel="nofollow noopener noreferrer">Data Science Journal</a> aims to provide best
practices for providing reproducibility in research datasets. This is necessary
to pinpoint the version of the dataset that grounds any research. In this work
the authors reviewed 39 use cases from 33 organizations to arrive at six
principles for versioning datasets. These include <strong>Revision</strong>, <strong>Release</strong>,
<strong>Granularity</strong>, <strong>Manifestation</strong>,
<span style="color:purple"><strong>P</strong></span><strong>rovenance</strong> and <strong>Citation</strong>. See the
full work below. 👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://datascience.codata.org/articles/10.5334/dsj-2021-012/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Versioning Data is About More Than Revisions: A Conceptual Framework and Proposed Priniciples</h4>
<div class="elp-description">Authors analyze 39 use cases in 33 organziations to arrive at proposed principles when versioning data.</div>
<div class="elp-link">https://datascience.codata.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-07-16/dsj-7d89083918a0e490fd9e1511e6ad40bc.png" alt="Versioning Data is About More Than Revisions: A Conceptual Framework and Proposed Priniciples">
</div>
</a>
</section>
<p></p>
<h2 id="june-office-hours-meetup" style="position:relative;">June Office Hours Meetup<a href="#june-office-hours-meetup" aria-label="june office hours meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The June Office Hours Meetup was 🔥! Amazing discussion on experiments ignited
by <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer">Sami Jawhar</a> of
<a href="https://www.kernel.com/" target="_blank" rel="nofollow noopener noreferrer">Kernel</a> around experiment use cases and workflows.<br>
You can
<a href="https://github.com/sjawhar/dvc-cloud-runner" target="_blank" rel="nofollow noopener noreferrer">find the repo for his presentation here</a>
and watch all the great DVC discussion below.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DxZdWq3Weng?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><span style="color:purple"><strong>S</strong></span>ummer and vaccinations mean travel! ☀️💉
And that travel has enabled some of our team members to get together! Pictured
below are Dmitry Petrov, Alexander Guschin, Max Shmakov, Mikhail Rozhkov, Sergey
Kryukov, Mikhail Sveshnikov, and Guro Bokum… But not necessarily in that
order.</p>
<p>The first person to guess the correct order of our teammates starting from the
upper right of the picture moving clockwise, <strong>and</strong> post in the corresponding
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> Heartbeat post, will win some DVC SWAG!
Hint: If you've been wondering why there are random purple letters in this blog
post, they're a clue to this cipher. 🧐</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 661px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7c5e525a8aaaf6a2e6a3cb01591fc88a/d5cf8/team.png" alt="team" title="team" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Team Meetup in Moscow! (hand
signals obscured for our UK friends, because we care! 🤗)</em></p>
<h2 id="new-team-member" style="position:relative;">New Team Member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/david-de-la-iglesia-castro-b4b67b20a/" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia Castro</a>
is the third teammate joining us from Spain! 🇪🇸 And also the third David! He
hails from Galicia and has been an active member of our Community for over two
years. We are so excited to have him join the team as a software enginer where
he will wor<span style="color:purple"><strong>k</strong></span> to improve DVC Live. When
he's not contributing to DVC, David likes to go climbing, surfing or just hiking
whenever he can! Welcome David!</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And yes indeed, we are still hiring!
<a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Front-End Engineer (TypeScript, Node, React)</li>
<li>Senior Software Engineer (ML, Dev Tools, Python)</li>
<li>Senior Software Engineer (ML, Data Infra, GoLang)</li>
<li>Machine Learning Engineer/Field Data Scientist</li>
<li>Developer Advocate (ML)</li>
<li>Director/VP of Engineering (ML, DevTools)</li>
<li>Director/VP of Product (ML, Data Infra, SaaS)</li>
<li>Director/VP of Operations/Chief of Staff</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Don't miss our
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>
July 28th at 2:00 pm UTC (7:00 am PDT), where
<a href="https://www.linkedin.com/in/jcpsantiago/" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> of
<a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie</a> will present "DVThis" a set of utility
functions for DVC pipelines using R scripts. Additionally the project aims to
document the usual workflows of a DVC pipeline using these scripts and create
templates for the use of DVC and R together.</p>
<p>Following Santiago, team member
<a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Sitaula</a> will give
a demo of DVC Studio! Bring your questions; we look forward to seeing you!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVThis</h4>
<div class="elp-description">July DVC Office Hours with João Santiago of Billie shows us how to use R with DVC, presenting DVThis and Tapa Dipti Sitaula shares a demo of DVC Studio.</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-07-16/office-hours-meetup-4d64171025fb882a3b68512f807f2d53.png" alt="DVThis">
</div>
</a>
</section>
<p></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Fantastically detailed tutorial from <a href="https://twitter.com/AntoineToubhans">@AntoineToubhans</a> on how to build a customizable web UI for <a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a> with <a href="https://twitter.com/streamlit">@Streamlit</a> and <a href="https://twitter.com/DVCorg">@DVCorg</a>! 🐍🎈<a href="https://t.co/zrZCueWk0n">https://t.co/zrZCueWk0n</a></p>— Charly Wargnier (@DataChaz) <a href="https://twitter.com/DataChaz/status/1410319379837894656">June 30, 2021</a></blockquote>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/june-21-community-gemshttps://dvc.org/blog/june-21-community-gemsWed, 30 Jun 2021 00:00:00 GMT<h3 id="q-is-it-possible-to-plot-multiple-experiments-together" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/834387923482181653" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to plot multiple experiments together?</a><a href="#q-is-it-possible-to-plot-multiple-experiments-together" aria-label="q is it possible to plot multiple experiments together permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can use experiment names in the <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a> commands. You need to use the
<code>diff</code> command to compare multiple plots. Try
<a href="https://dvc.org/doc/command-reference/plots/diff#-"><code>dvc plots diff exp-6ef18 exp-b17b4 exp-26e88</code></a>.</p>
<p>Thanks to @PythonF from Discord for asking this question that led to this Gem!
💎</p>
<h3 id="q-where-is-the-list-of-experiment-being-pushed-in-git-when-i-run-dvc-exp-push" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/837773937390649364" target="_blank" rel="nofollow noopener noreferrer">Q: Where is the list of experiment being pushed in Git when I run <code>dvc exp push</code>?</a><a href="#q-where-is-the-list-of-experiment-being-pushed-in-git-when-i-run-dvc-exp-push" aria-label="q where is the list of experiment being pushed in git when i run dvc exp push permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It uses custom Git refs internally, similar to the way GitHub handles PRs. It’s
a custom DVC Git ref pointing to a Git commit. Here's an example.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref exp-26220
</span>c42f48168830148b946f6a75d1bdbb25cda46f35 refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220</code></pre></div>
<p>If you want to see your local experiments (that have not been pushed), you can
run <a href="https://dvc.org/doc/command-reference/exp/list#--all"><code>dvc exp list --all</code></a>.</p>
<p>You can read more about how we handle our custom Git refs in
<a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this blog post</a>.</p>
<p>Thanks to @Chandana for asking this question about experiments!</p>
<h3 id="q-is-there-a-way-to-list-all-the-experiments-i-have-on-my-dvc-remote-that-have-not-been-committed-to-git" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/836705209039978538" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to list all the experiments I have on my DVC remote that have not been committed to Git?</a><a href="#q-is-there-a-way-to-list-all-the-experiments-i-have-on-my-dvc-remote-that-have-not-been-committed-to-git" aria-label="q is there a way to list all the experiments i have on my dvc remote that have not been committed to git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! You can quickly look at all of the experiments in any repo with:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> <span class="token parameter variable">--all</span> <span class="token operator"><</span>git repo URL<span class="token operator">></span></span></code></pre></div>
<p>or</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> <span class="token parameter variable">--all</span> <span class="token operator"><</span>git remote<span class="token operator">></span></span></code></pre></div>
<p>Thanks again @Chandana for this gem!</p>
<h3 id="q-is-cml-compatible-with-azure-devops" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/841664412221177926" target="_blank" rel="nofollow noopener noreferrer">Q: Is CML compatible with Azure DevOps?</a><a href="#q-is-cml-compatible-with-azure-devops" aria-label="q is cml compatible with azure devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another great question from @Chandana!</p>
<p>Right now, we support GitHub and GitLab.</p>
<p>Azure DevOps and GCP (Google Cloud Platform) support are on the roadmap. Stay
tuned for more updates!</p>
<p>You can stay up to date with our Azure DevOps progress on
<a href="https://github.com/iterative/cml/issues/142" target="_blank" rel="nofollow noopener noreferrer">this issue on GitHub</a>. You can
also follow along with GCP updates with
<a href="https://github.com/iterative/terraform-provider-iterative/issues/64" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p>
<h3 id="q-i-pushed-a-lot-of-files-using-dvc-push-to-my-dvc-remote-but-there-are-a-few-that-couldnt-be-pushed-at-the-time-if-i-run-dvc-push-again-will-it-just-upload-the-missing-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/842662337159757854" target="_blank" rel="nofollow noopener noreferrer">Q: I pushed a lot of files using <code>dvc push</code> to my DVC remote, but there are a few that couldn't be pushed at the time. If I run <code>dvc push</code> again, will it just upload the missing files?</a><a href="#q-i-pushed-a-lot-of-files-using-dvc-push-to-my-dvc-remote-but-there-are-a-few-that-couldnt-be-pushed-at-the-time-if-i-run-dvc-push-again-will-it-just-upload-the-missing-files" aria-label="q i pushed a lot of files using dvc push to my dvc remote but there are a few that couldnt be pushed at the time if i run dvc push again will it just upload the missing files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @petek!</p>
<p>Yes! You can just re-run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and it will only upload the missing files.</p>
<p>It might be a little slower than you would expect because DVC has to do some
checks to make sure that the other files were uploaded successfully before, but
as far as the actual data transfer goes, only the missing files will be
uploaded.</p>
<h3 id="q-lets-say-i-have-a-dvc-pipeline-with-two-stages-can-i-only-pull-the-second-one-and-keep-the-first-one-for-other-uses-can-i-pull-some-specific-output-from-the-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/841688323663855616" target="_blank" rel="nofollow noopener noreferrer">Q: Let's say I have a DVC pipeline with two stages, can I only pull the second one and keep the first one for other uses? Can I pull some specific output from the pipeline?</a><a href="#q-lets-say-i-have-a-dvc-pipeline-with-two-stages-can-i-only-pull-the-second-one-and-keep-the-first-one-for-other-uses-can-i-pull-some-specific-output-from-the-pipeline" aria-label="q lets say i have a dvc pipeline with two stages can i only pull the second one and keep the first one for other uses can i pull some specific output from the pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can pull specific outputs from a pipeline with
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull path/to/specific/output</code></a>. This is similar to how you can use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>
to work with specific files and directories.</p>
<p>Thanks for such a great question @LucZ!</p>
<h3 id="q-how-does-dvc-handle-incremental-changes-in-the-data-and-how-does-it-work-with-non-dvc-based-pipeline-features" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/846364469524430848" target="_blank" rel="nofollow noopener noreferrer">Q: How does DVC handle incremental changes in the data and how does it work with non-DVC based pipeline features?</a><a href="#q-how-does-dvc-handle-incremental-changes-in-the-data-and-how-does-it-work-with-non-dvc-based-pipeline-features" aria-label="q how does dvc handle incremental changes in the data and how does it work with non dvc based pipeline features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>These are good questions for common problems in MLOps from @Phoenix!</p>
<p>To answer the first part, say you are getting new data every week. When you use
DVC, you don't have to worry about getting duplicate data.</p>
<p>DVC supports file-level deduplication right now, so if your data is in a shape
of directory with files, then all unique files will only be stored once.
Chunk-level deduplication is on our todo list. You can see how it's going in
<a href="https://github.com/iterative/dvc/issues/829" target="_blank" rel="nofollow noopener noreferrer">this issue we have on GitHub</a>.</p>
<p>For the second part of the question, you can use data management with DVC and
have your own pipelines. Just treat it as Git for data then be sure to
<a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and you should be set. Hooks, like
<code>pre-commit</code> or <code>post-pipeline-run</code>, are a good way to go about it.</p>
<h3 id="q-is-there-a-way-to-tell-dvc-to-use-a-different-profile-instead-of-the-default-profile-when-interacting-with-s3" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/846857498094469120" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to tell DVC to use a different profile instead of the default profile when interacting with S3?</a><a href="#q-is-there-a-way-to-tell-dvc-to-use-a-different-profile-instead-of-the-default-profile-when-interacting-with-s3" aria-label="q is there a way to tell dvc to use a different profile instead of the default profile when interacting with s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When you have a remote that is not on your default AWS profile and when you
access it via the <code>awscli</code> using something like
<code>aws s3 --profile=second_profile ls</code>, you'll need to update your remote config
in DVC.</p>
<p>You can run a command like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote profile myprofile</span></code></pre></div>
<p>Check out the docs on <a href="https://dvc.org/doc/command-reference/remote/modify"><code>dvc remote modify</code></a> for all the remote config options.</p>
<p>Great question @Avi!</p>
<hr>
<p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p>
<p>At our July Office Hours Meetup we will be demo-ing pipelines as well as CML.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/june-21-dvc-heartbeathttps://dvc.org/blog/june-21-dvc-heartbeatFri, 18 Jun 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>This month I'm going to take you on a thought provoking journey through some of
the content from our community.</p>
<p><img src="https://media.giphy.com/media/Uni2jYCihB3fG/giphy.gif" alt="So many choices..."></p>
<h2 id="lj-mirandas-triad-of-order" style="position:relative;">LJ Miranda's Triad of order<a href="#lj-mirandas-triad-of-order" aria-label="lj mirandas triad of order permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The MLOps tool landscape can be confusing to say the least.<br>
<a href="https://twitter.com/ljvmiranda921" target="_blank" rel="nofollow noopener noreferrer">LJ Miranda</a>, in a well written
<a href="https://ljvmiranda921.github.io/notebook/2021/05/10/navigating-the-mlops-landscape/" target="_blank" rel="nofollow noopener noreferrer">three-part series</a>
lays out a framework for making sense of this space. The list of tools is not
exhaustive, but the framework and thought process for evaluating the tools is
intriguing. Additionally he encourages thinking about the skillset of the
members of your team within this framework to help you make decisions on the
right tools. It's not just about the tools, it's about the people!</p>
<p>As you can see DVC makes it into the "Trial" loop, but we think we will be be
making it into the adoption region in relatively short order. 😉🚀</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 675px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e9aa3f0108be0f8003703fa2dce42573/39600/LJMiranda.png" alt="LJMiranda" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Making sense of the MLOps
Landscape</em></p>
<h2 id="found-in-the-mlops-community" style="position:relative;">Found in the MLOps Community<a href="#found-in-the-mlops-community" aria-label="found in the mlops community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>You can find more comments from LJ Miranda and others in response to a
<a href="https://mlops-community.slack.com/?redir=%2Farchives%2FC015J2Y9RLM%2Fp1622714574054300" target="_blank" rel="nofollow noopener noreferrer">great question</a>
from André Godinho in the <a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps Community</a> Slack (see
below). If you're into MLOps and you're NOT a part of this Community, you should
be. You can join their Slack
<a href="https://mlops-community.slack.com/join/shared_invite/zt-o96abp9z-sRYKWb96wGK9vdhUvbSrsQ#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<blockquote>
<p>I have recently came across with DVC by listening to MLOps Coffee Sessions #6
with David Aponte and Elle O'Brien (Such an interesting talk! 💯). This tool
integrates smoothly with Git, tracks models & datasets, and also has an online
UI DVC Studio 🚀. Is there any use case of MLflow that DVC can't handle? I
find DVC to give more rise to creativity as it integrates really well with
Git. - André Godinho</p>
</blockquote>
<h2 id="neda-sultovas-tutorial-and-tool-rubric" style="position:relative;">Neda Sultova's Tutorial and Tool Rubric<a href="#neda-sultovas-tutorial-and-tool-rubric" aria-label="neda sultovas tutorial and tool rubric permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Drilling down to the next level, I give you
<a href="https://medium.com/geekculture/exploring-dvc-for-machine-learning-pipelines-in-research-part-1-3ebc2ca35a18" target="_blank" rel="nofollow noopener noreferrer">this tutorial</a>
by <a href="https://www.linkedin.com/in/neda-sultova-597a811a8/" target="_blank" rel="nofollow noopener noreferrer">Neda Sultova</a>. Not only
is it a great tutorial of DVC in and of itself, but Neda also defines a clear
framework for the decision making process at
<a href="https://www.helmholtz.ai/" target="_blank" rel="nofollow noopener noreferrer">Helmholtz AI</a>. Among the needs are reproducibility,
workflow integration, exchangeable backend, framework agnostic, open source, and
the ability of the solution to be tweaked to the team's needs.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/geekculture/exploring-dvc-for-machine-learning-pipelines-in-research-part-1-3ebc2ca35a18" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Exploring DVC for Machine Learning Pipelines in Research (Part 1)</h4>
<div class="elp-description">The first of a multi-part series on the search and decision making process for MLOps tools at Helmholtz AI.</div>
<div class="elp-link">https://medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-06-18/neda-sultova-eaa1b8385248cb7e979bc5bc7c3a3461.png" alt="Exploring DVC for Machine Learning Pipelines in Research (Part 1)">
</div>
</a>
</section>
<p></p>
<h2 id="our-philosophy" style="position:relative;">Our Philosophy<a href="#our-philosophy" aria-label="our philosophy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And at last I bring you to
<a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" target="_blank" rel="nofollow noopener noreferrer">"The Road to AI Hell Starts with Good MLOps Intentions" </a>
by our CEO <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> which explains our
philosophy in the MLOps space. You will learn about the experiences that led to
developing our tools, what we think is the right way to solve MLOps challenges,
and how we do it.</p>
<blockquote>
<p>Teams made up of data scientists and developers should be able to define their
own workflow based on their business requirements and team preferences, just
like they do today when constructing any other software artifact. Rather than
a platform forcing teams to embrace a highly opinionated workflow, they can
employ flexible tools such Git, GitHub, and their existing CI tools as they
see fit. - Dmitry Petrov</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The Road to AI Hell Starts with Good MLOps Intentions</h4>
<div class="elp-description">Dmitry Petrov explains the journey and philosophy at the heart of Iterative.ai's MLOps tools.</div>
<div class="elp-link">https://thenewstack.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-06-18/ai-hell-12bc4bdc3b703583bb29a50479563837.png" alt="The Road to AI Hell Starts with Good MLOps Intentions">
</div>
</a>
</section>
<p></p>
<h1 id="big-news-" style="position:relative;">Big News! 🚀🚀🚀<a href="#big-news-" aria-label="big news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>In case you missed it, June 3rd we introduced our latest tool: DVC Studio! A web
application that GUI display your team's work with DVC and CML. We know this has
been on our Community's wishlist and now it's here! You can check out all its
features and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">give it a try here</a>. Or check out
the introduction video below.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/hKf4twg832g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="r-for-dvc" style="position:relative;">R for DVC!<a href="#r-for-dvc" aria-label="r for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Are you or someone on your team an R user?
<a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> who has contributed to DVC,
recently came up with "dvcru" to provide utility functions for DVC pipelines
using R scripts. Additionally the project aims to show typical workflows they
enable as well as provide project templates. Check out all the R goodness in
<a href="https://github.com/jcpsantiago/dvcru" target="_blank" rel="nofollow noopener noreferrer">this Github Repository</a>.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://github.com/jcpsantiago/dvcru" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">dvcru</h4>
<div class="elp-description">João Santiago's repository for dvcru, providing utility functions for DVC Pipelines using R scripts.</div>
<div class="elp-link">https://github.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-06-18/r-76a084e1e5c947fe6dfecf2312218942.png" alt="dvcru">
</div>
</a>
</section>
<p></p>
<h2 id="milecia-mcgregor-at-pydata-socal" style="position:relative;">Milecia McGregor at PyData SoCal<a href="#milecia-mcgregor-at-pydata-socal" aria-label="milecia mcgregor at pydata socal permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Next up we have <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milecia McGregor</a> presenting
and live coding at <a href="https://www.meetup.com/PyData-SoCal/" target="_blank" rel="nofollow noopener noreferrer">PyData SoCal</a>
organized by <a href="https://twitter.com/MaverickPramit" target="_blank" rel="nofollow noopener noreferrer">Pramit Choudhary</a>. Check out
her talk on "Reproducible ML Experiments (with Git and DVC)" and all the great
questions that ensued.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h0vDuw3s2fE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="dmitry-petrov-at-mlops-world" style="position:relative;">Dmitry Petrov at MLOps World<a href="#dmitry-petrov-at-mlops-world" aria-label="dmitry petrov at mlops world permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Finally we have <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov's</a> talk at the
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World Conference</a> about machine learning in
production entitled "Data Versioning and ML Experiments on Top of Git."</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Lc0hsT-i7qo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We're still growing! Meet this month's new team members.</p>
<h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/jelle-bouwman/" target="_blank" rel="nofollow noopener noreferrer">Jelle Bouwman</a> joins us from
Utrecht, Netherlands as a software engineer. He's worked as a consultant and at
an agency. He's most proud of
<a href="https://rotterdam.navigate-connections.com/voyages" target="_blank" rel="nofollow noopener noreferrer">the work he did with his team at the Port of Rotterdam</a>.
In his free time, Jelle loves reading fiction and books on human
psychology/productivity, hiking and making music with others. He has already
shared with the team a
<a href="https://open.spotify.com/album/1LqgEMQNmL2yvjsGpihGee?si=7tCaG8-QQ92xvrlVvaUR7A" target="_blank" rel="nofollow noopener noreferrer">great playlist</a>
to listen to while trying to focus! Welcome Jelle! 🎼</p>
<p>Next we welcome <a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer">Alexander Gushcin</a>.
Alexander joins us from Russia where he has been a Data Scientist/ML Engineer
for the last five years. He's also participated in many Kaggle competitions and
was ranked 5th in general competitions at some point! This led him to create a
Coursera course on
<a href="https://www.coursera.org/learn/competitive-data-science" target="_blank" rel="nofollow noopener noreferrer">how to win data science competitions</a>
about the tips and tricks needed to win one. Teaching is his passion and you
will probably see him producing some content in the near future. 🧑🏽💻</p>
<p><a href="https://www.linkedin.com/in/mike0sv/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Sveshnikov</a> also joins us from
Russia where he formerly worked as a Data Engineer Team Lead for Rubbles. He
created <a href="https://github.com/zyfra/ebonite" target="_blank" rel="nofollow noopener noreferrer">ebonite</a>, an ML deployment tool and
teaches Python and Big Data at HSE University. Finally he is one of the admins
of <a href="https://ods.ai/" target="_blank" rel="nofollow noopener noreferrer">ods.ai</a> community, which creates global projects to unite
the community, promote Data Science, and help people develop their skills. In
his spare time he likes to play guitar, badminton, ski, and mix cocktails. 🍸
Cheers Mikhail!</p>
<p><a href="https://www.linkedin.com/in/jervishui/" target="_blank" rel="nofollow noopener noreferrer">Jervis Hui</a> is joining the go-to-market
team at Iterative and is from NYC. He's worked in product marketing at various
Silicon Valley tech companies over the years and is excited to bring his
experience to the open source world of Iterative. He's passionate about D&I in
hiring and looks forward to learning from everyone! We're excited to have Jervis
on board! 🎉</p>
<p><img src="https://media.giphy.com/media/Kzo0heGPi6xwjpC5JL/giphy.gif" alt="Hiring GIF"></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And yes indeed, we are still hiring!
<a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a>
to find details of all the positions including:</p>
<ul>
<li>Senior Front-End Engineer (TypeScript, Node, React)</li>
<li>Senior Software Engineer (ML, Dev Tools, Python)</li>
<li>Senior Software Engineer (ML, Data Infra, GoLang)</li>
<li>Machine Learning Engineer/Field Data Scientist</li>
<li>Developer Advocate (ML)</li>
<li>Director/VP of Engineering (ML, DevTools)</li>
<li>Director/VP of Product (ML, Data Infra, SaaS)</li>
<li>Director/VP of Operations/Chief of Staff</li>
</ul>
<p>Please pass this info on to anyone you know that may fit the bill. We look
forward to new team members! 🎉</p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Don't miss our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>
June 24th at 3:00 pm UTC (8:00 am PDT), where
<a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer">Sami Jawhar</a> of Kernel will
present different experiment use cases. Bring your questions and thinking cap!
It's bound to be a great session!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/278729121/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">dvcru</h4>
<div class="elp-description">June DVC Office Hours with Sami Jawhar of Kernel presenting experiment use cases.</div>
<div class="elp-link">https://meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-06-18/meetup-a796bdc01514d6fdf2c359ca264f8ef9.png" alt="dvcru">
</div>
</a>
</section>
<p></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Some people asked about <a href="https://twitter.com/DVCorg">@DVCorg</a> and how I use it so here's am <a href="https://twitter.com/hashtag/Rstats?src=hash&ref_src=twsrc%5Etfw">#Rstats</a> 📦 I'm creatively calling {dvcru} with some utility functions and documentation about how to use the DVC workflow. It will also bootstrap a project with DVC once I push some changes. Check it out!</p>— João Santiago (@jcpsantiago) <a href="https://twitter.com/jcpsantiago/status/1402221732480569349">June 8, 2021</a></blockquote>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/introducing-dvc-studiohttps://dvc.org/blog/introducing-dvc-studioWed, 02 Jun 2021 00:00:00 GMT<p>We are excited to release DVC Studio - the online UI for DVC and CML.</p>
<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> have been widely used by ML
engineers, data scientists and researchers to simplify their Machine Learning
processes. With 8000 GitHub 🌟 and 200+ open source contributors, they have
gained popularity as tools that take advantage of the existing engineering
toolset that you're already familiar with (Git, CI/CD, etc.) to provide you the
best practices for organizing your data and ML projects and collaborating
effectively. DVC Studio, an extension on top of DVC and CML, adds even more
capabilities to your MLOps toolset.</p>
<p>DVC Studio is a big new step for our team. Many of you have rightly pointed out
the <a href="https://github.com/iterative/dvc/issues/1074" target="_blank" rel="nofollow noopener noreferrer">need for a visual UI</a> for
DVC. Your needs,
<a href="https://github.com/iterative/dvc/discussions/5941" target="_blank" rel="nofollow noopener noreferrer">ideas and suggestions</a> are
our priority. And so, we are thrilled that our new product will make your ML
journeys even more smooth.</p>
<h2 id="how-does-dvc-studio-work" style="position:relative;">How does DVC Studio work?<a href="#how-does-dvc-studio-work" aria-label="how does dvc studio work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC Studio is a web application that you can
<a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">access online</a> or even host on-prem. It works
with the data, metrics and hyperparameters that you add to your ML project
repositories.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b517c192e0755ae304bc4427d44d5cc8/39600/dvc-studio-view.png" alt="dvc studio view" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Each experiment,
represented by a commit in your Git history, is presented along with its data,
metrics and hyperparameters. This is your playground for visualizing, comparing
and even running experiments.</em></p>
<p>With DVC Studio we rely on you saving information into your Git repository.
Connect DVC Studio with GitHub, GitLab or Bitbucket to read repositories and to
run new experiments (using regular CI/CD capabilities - we'll talk about this in
a moment).</p>
<p>DVC Studio analyzes Git history and extracts information about your ML
experiments - datasets being used, metrics and hyperparameters. By using DVC,
you can be sure not to bloat your repositories with large volumes of data or
huge models. These large assets reside in cloud or other remote storage
locations (and we don't require you giving us access to it!).</p>
<h2 id="visualize-collaborate-track" style="position:relative;">Visualize. Collaborate. Track.<a href="#visualize-collaborate-track" aria-label="visualize collaborate track permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This video shows you how you can visualize your experiments using DVC Studio.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/hKf4twg832g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>DVC, along with Git, performs your ML bookkeeping automatically. Using a simple
UI, you can import your experiment history from Git. You can get quick access to
important metrics across multiple projects, or dive deep and explore individual
experiments. You can visualize and compare models the way that best fits your
needs, whether it is through precision-recall curves, scores comparison, or
trend charts showing how your model is evolving over time.</p>
<p>This makes it easy to see exactly how your model’s performance changed when you
increased the number of layers in your neural net, added some more samples to
your training dataset, or increased the number of epochs to run the training
for.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eab5b427468a7c6ddc2ce8f487243048/39600/trends-chart.png" alt="trends chart" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>With DVC Studio, you can
visualize your model evolution. This Trends chart, for instance, shows how the
average precision increased over the course or your experiments.</em></p>
<p>You will get the dashboard and all the visuals automatically if your metrics and
plots are stored in Git through DVC. But if you do not use DVC, you can still
add custom files with your metrics and parameters and DVC Studio will
efficiently generate tables and plots for your custom input.</p>
<p>DVC Studio also provides visual UI to create and manage teams, manage roles, and
share your experiment tables, enabling easy and efficient knowledge sharing and
collaboration.</p>
<h2 id="use-git-for-ml-metrics-tracking-nothing-fancy" style="position:relative;">Use Git for ML metrics tracking. Nothing fancy.<a href="#use-git-for-ml-metrics-tracking-nothing-fancy" aria-label="use git for ml metrics tracking nothing fancy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Most ML engineers already use Git for code versioning. <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>,
<a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> - these simple Git-like DVC commands are all you need to convert your
Git repos into DVC repos - a single source of truth for not just your code but
also your data, model and metrics.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/5xM5az78Lrg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>What makes DVC Studio special is this connection to the Git ecosystem. The table
and visuals in DVC Studio aren’t magic - they are simply a representation of the
data in JSON or CSV files in your Git repositories.</p>
<h2 id="automate-your-ml-process-no-code" style="position:relative;">Automate your ML process. No-code.<a href="#automate-your-ml-process-no-code" aria-label="automate your ml process no code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Mature ML teams reuse their code over and over again while tuning data and
hyperparameters. DVC Studio automates this in the visual user interface. To run
an experiment on DVC Studio, use its UI to modify the ML model hyperparameters
and dataset version. The modifications and the message you enter will be
automatically converted to a proper Git commit. Your team members can see the
changes through your Git platform or DVC Studio and track the author and
timestamp of the change.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/nXJXR-zBvHQ?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>If your project is integrated with the CI/CD process, the model training process
will be automatically triggered. Once the experiment completes, all its inputs
and outputs are available in DVC Studio, ready for visualizing and comparing.
This visual modification helps your team to iterate faster and avoid mistakes
with manual code changes.</p>
<p><a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a> can create reports and orchestrate resources in your
cloud (GCP, AWS or Azure) or Kubernetes to run training. Because this is
cloud-agnostic, you are not tied to a particular cloud provider, and this helps
you avoid vendor lock-in.</p>
<p>With this approach, the managers, and DevOps folks who are not experts in
creating ML models, can also be part of the ML model training process. They can
re-train your model on a new version of the dataset or try other changes to your
model.</p>
<h2 id="create-magic" style="position:relative;">Create magic!<a href="#create-magic" aria-label="create magic permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>So, don’t reinvent the wheel. Use Git. Through a simple UI. Use your existing
CI/CD setup. Use your existing cloud. Get the most out of them. And create magic
:) Okay, the tables and visuals in DVC Studio aren’t magic, but they sure are
magical. Right?</p>
<h2 id="get-started-now" style="position:relative;">Get started now<a href="#get-started-now" aria-label="get started now permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Get started at <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">https://studio.datachain.ai</a>.
Simply connect with your GitHub, GitLab or Bitbucket account. No additional
sign-ups are required.</p>
<p>For more information on how to use DVC Studio, please check out the
<a href="https://dvc.org/doc/studio" target="_blank" rel="nofollow noopener noreferrer">docs</a>.</p>
<p>DVC Studio is completely free for individuals and small teams. Let us know if
you would like to set up DVC Studio
for<a href="https://form.typeform.com/to/nydf3Oys?typeform-medium=embed-snippet" target="_blank" rel="nofollow noopener noreferrer"> 5+ member teams</a>
or for
<a href="https://form.typeform.com/to/bd9lTEt9?typeform-medium=embed-snippet" target="_blank" rel="nofollow noopener noreferrer">enterprises</a>,
and we will get back to you soon.</p>
<p>We would love to get your feedback. Reach out to us with your questions,
concerns or requests on <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>. Head to
the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas. You can also
raise an issue on <a href="https://github.com/iterative/studio-support" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>
<p>We are super excited to have you use DVC Studio. We’re confident that it’ll make
your Machine Learning journeys so much easier. We can’t wait to hear how it
goes.</p>https://dvc.org/blog/may-21-community-gemshttps://dvc.org/blog/may-21-community-gemsFri, 28 May 2021 00:00:00 GMT<p>Each month we go through our Discord messages to pull out some of the best
questions from our community. AKA: Community Gems. 💎 This month we'd like to
thank @asraniel, @PythonF, @mattlbeck, @Ahti, @yikeqicn, @lexzen, @EdAb,
@FreshLettuce for inspiring this month's gems!</p>
<p>As always, <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">join us in Discord</a> to get all
your DVC and CML questions answered!</p>
<h2 id="dvc" style="position:relative;">DVC<a href="#dvc" aria-label="dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="what-is-the-best-way-to-commit-2-experiment-runs" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/836626346544594995" target="_blank" rel="nofollow noopener noreferrer">What is the best way to commit 2 experiment runs?</a><a href="#what-is-the-best-way-to-commit-2-experiment-runs" aria-label="what is the best way to commit 2 experiment runs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You want to use <a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a> if you want to keep multiple experiments. That
way, each one is in a separate branch rather than trying to apply one experiment
on top of another.</p>
<h3 id="how-can-i-clean-up-the-remote-caches-after-a-lot-of-experiments-and-branches-have-been-pushed" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/831142466169733120" target="_blank" rel="nofollow noopener noreferrer">How can I clean up the remote caches after a lot of experiments and branches have been pushed?</a><a href="#how-can-i-clean-up-the-remote-caches-after-a-lot-of-experiments-and-branches-have-been-pushed" aria-label="how can i clean up the remote caches after a lot of experiments and branches have been pushed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp gc</code></a> requires some kind of flags to operate. At the very least,
<code>--workspace</code>. So, with <code>--workspace</code>, <code>dvc</code> will try to read all of the pointer
files: <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files and <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files in the workspace. It will read all of
them and will determine all the cache objects/files that need to be preserved
(since they are being used in the current workspace). The rest of the files in
the <code>.dvc/cache</code> are removed.</p>
<p><em>This does not require any Git operations!</em></p>
<p>You can also use the <code>--all-branches</code> flag. It will read all of the files
present in the current workspace and from the commits in the branches you have
locally. Then it will use that list to determine what to keep and what to
remove.</p>
<p>If you need to read pointer files from given tags you have locally, the
<code>--all-tags</code> flag is the best option.</p>
<p>The <code>--all-commits</code> flag reads pointer files from every commit and it will make
a list of all the files that are in the cache/remote and if the <em>.dvc</em> file
isn't found in any commits of the Git repo, it will delete those files.</p>
<h3 id="if-i-have-two-cloud-folder-links-added-to-the-dvc-config-im-able-to-push-the-data-to-the-default-one-how-could-i-push-the-data-to-the-other-cloud-folder" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/833176227762274364" target="_blank" rel="nofollow noopener noreferrer">If I have two cloud folder links added to the DVC config, I'm able to push the data to the default one. How could I push the data to the other cloud folder?</a><a href="#if-i-have-two-cloud-folder-links-added-to-the-dvc-config-im-able-to-push-the-data-to-the-default-one-how-could-i-push-the-data-to-the-other-cloud-folder" aria-label="if i have two cloud folder links added to the dvc config im able to push the data to the default one how could i push the data to the other cloud folder permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You're looking for the <code>-r / --remote</code> option for <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>. The command looks
like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--remote</span> <span class="token operator"><</span>name_of_remote_storage<span class="token operator">></span></span></code></pre></div>
<p>It will push directly to the remote storage you defined in the command above.</p>
<h3 id="whats-the-current-recommended-way-to-automate-hyperparameter-search-when-using-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/829803720190590986" target="_blank" rel="nofollow noopener noreferrer">What's the current recommended way to automate hyperparameter search when using DVC pipelines?</a><a href="#whats-the-current-recommended-way-to-automate-hyperparameter-search-when-using-dvc-pipelines" aria-label="whats the current recommended way to automate hyperparameter search when using dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Take a look at the new
<a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">experiments feature</a>! It enables you to
easily experiment with different parameter values.</p>
<p>You could script a grid search pretty easily by queueing an experiment for each
set of parameter values you want to try. For example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">alpha</span><span class="token operator">=</span><span class="token punctuation">{</span>alpha<span class="token punctuation">}</span>,beta<span class="token operator">=</span><span class="token punctuation">{</span>beta<span class="token punctuation">}</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span> <span class="token parameter variable">--jobs</span> <span class="token number">2</span></span></code></pre></div>
<p>The <code>--jobs 2</code> flag means you're running 2 queued experiments in parallel. By
default, the <code>--run-all</code> flag runs 1 queued experiment at a time.</p>
<p>Then you can compare the results with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>.</p>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span>
</span> ───────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 0.56191 0.93345 50 2
master 0.55259 0.91536 50 2
├── exp-bfe64 0.57833 0.95555 50 8
├── exp-b8082 0.59806 0.95287 50 64
├── exp-c7250 0.58876 0.94524 100 2
├── exp-b9cd4 0.57953 0.95732 100 8
├── exp-98a96 0.60405 0.9608 100 64
└── exp-ad5b1 0.56191 0.93345 50 2
</span> ───────────────────────────────────────────────────────────────────</code></pre></div>
<p>We are working on developing experiments to have features or documented patterns
explicitly for grid search support, so definitely
<a href="https://github.com/iterative/dvc/issues/4283" target="_blank" rel="nofollow noopener noreferrer">share any feedback</a> to help drive
the future direction of that!</p>
<h3 id="when-importinggetting-data-from-a-repo-how-do-i-provide-credentials-to-the-source-repo-remote-storage-without-saving-it-into-that-git-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/830021022337073185" target="_blank" rel="nofollow noopener noreferrer">When importing/getting data from a repo, how do I provide credentials to the source repo remote storage without saving it into that Git repo?</a><a href="#when-importinggetting-data-from-a-repo-how-do-i-provide-credentials-to-the-source-repo-remote-storage-without-saving-it-into-that-git-repo" aria-label="when importinggetting data from a repo how do i provide credentials to the source repo remote storage without saving it into that git repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There's a bit of context behind this question that might give it more meaning.
Here's the background information given by @EdAb in Discord:</p>
<hr>
<p>I set up a private GitHub repo to be a data registry and I have set up a private
Azure remote where I have pushed some datasets.</p>
<p>I am now trying to read those datasets from another repository
("my-project-repo"), using <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> (e.g.
<a href="https://dvc.org/doc/command-reference/get#-registry-repo"><code>dvc get [email protected]:data-registry-repo.git path/data.csv</code></a>) but I get this
error:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ERROR: failed to get <span class="token string">'path/data.csv'</span> from <span class="token string">'[email protected]:data-registry-repo.git'</span> - Authentication to Azure Blob Storage via default credentials <span class="token punctuation">(</span>https://azuresdkdocs.blob.core.windows.net/<span class="token variable">$web</span>/python/azure-identity/1.4.0/azure.identity.html<span class="token comment">#azure.identity.DefaultAzureCredential) failed.</span>
Learn <span class="token function">more</span> about configuration settings at <span class="token operator"><</span>https://man.dvc.org/remote/modify<span class="token operator">></span>: unable to connect to account <span class="token keyword">for</span> Must provide either a connection_string or account_name with credentials<span class="token operator">!</span><span class="token operator">!</span></code></pre></div>
<hr>
<p>Generally, there are two ways solve this issue:</p>
<ul>
<li><a href="https://dvc.org/doc/command-reference/remote/modify" target="_blank" rel="nofollow noopener noreferrer">ENV vars</a></li>
<li>Setup some options using the <code>--global</code> or <code>--system</code> flags to update the DVC
config</li>
</ul>
<p>If you're going to update the DVC config to include your cloud credentials, use
the <a href="https://dvc.org/doc/command-reference/remote/modify"><code>dvc remote modify</code></a> command. Here's an example of how you can do that with
Azure using the <code>--global</code> flag.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--global</span> myremote connection_string <span class="token string">'mysecret'</span></span></code></pre></div>
<p>You should initialize <code>myremote</code> in the config file with <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> and
remove the URL to rely on the one that comes from the repo being imported.</p>
<p>This will modify the global config file, instead of the <em>.dvc/config</em> file. You
could also use the <code>--system</code> flag to modify the system file if that's necessary
for your project. You can take a look at the specific
<a href="https://dvc.org/doc/command-reference/config" target="_blank" rel="nofollow noopener noreferrer">config file locations here</a>.</p>
<h3 id="is-there-any-way-to-ensure-that-dvc-import-uses-the-cache-from-the-config-file-and-how-can-i-keep-the-cache-consistent-for-multiple-team-members" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/827574712825413672" target="_blank" rel="nofollow noopener noreferrer">Is there any way to ensure that <code>dvc import</code> uses the cache from the config file and how can I keep the cache consistent for multiple team members?</a><a href="#is-there-any-way-to-ensure-that-dvc-import-uses-the-cache-from-the-config-file-and-how-can-i-keep-the-cache-consistent-for-multiple-team-members" aria-label="is there any way to ensure that dvc import uses the cache from the config file and how can i keep the cache consistent for multiple team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is another great question where a little context might be useful.</p>
<hr>
<p>I'm trying to import a dataset project called <em>dvcdata</em> into another DVC
project.</p>
<p>The config for <em>dvcdata</em> is:</p>
<div class="gatsby-highlight" data-language="ini"><pre class="language-ini"><code class="language-ini"><span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">core</span><span class="token punctuation">]</span></span>
<span class="token key attr-name">remote</span> <span class="token punctuation">=</span> <span class="token value attr-value">awsremote</span>
<span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">cache</span><span class="token punctuation">]</span></span>
<span class="token key attr-name">type</span> <span class="token punctuation">=</span> <span class="token value attr-value">symlink</span>
<span class="token key attr-name">dir</span> <span class="token punctuation">=</span> <span class="token value attr-value">/home/user/dvc_cache</span>
<span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">'remote "awsremote"'</span><span class="token punctuation">]</span></span>
<span class="token key attr-name">url</span> <span class="token punctuation">=</span> <span class="token value attr-value">s3://...</span></code></pre></div>
<p>When I run <a href="https://dvc.org/doc/command-reference/import"><code>dvc import [email protected]:user/dvcdata.git my_data</code></a>, it starts to
download it. I have double checked that I have pushed this config file to master
and don't understand why it's not pulling the data from my cache instead of
downloading the data again.</p>
<hr>
<p>The repo you are importing into has its own cache directory. If you want to use
the same cache directory across both projects, you have to configure <em>cache.dir</em>
in both projects. You also have the option to configure the <em>cache.type</em>.</p>
<p>You can set up the cache dir and cache link type in your own global config and
then when project 1 imports <code>dvcdata</code>, it will be cached there. Finally when
project 2 imports <code>dvcdata</code>, it will just be linked or copied, depending on the
config, from the cache without downloading.</p>
<p>We recommend you use the <code>--global</code> or <code>--system</code> flags in the <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a>
command for updating the configs globally. An example of this would be:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> <span class="token parameter variable">--global</span> cache.dir path/to/cache/</span></code></pre></div>
<p>If you set up a cache that is not shared and located on a separate volume and
you have a lot of data - consider also enabling symlinks as described here -
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization#large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">Large Data Optimizations</a></p>
<p>You might also consider using the local URL of the source project to avoid the
import downloading from the remote storage. That would look something like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> /home/user/dvcdata my_data</span></code></pre></div>
<p>If your concern is keeping these configs consistent for multiple users on the
same machine, check out
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">the doc on shared server development</a>
to get more details!</p>
<h2 id="cml" style="position:relative;">CML<a href="#cml" aria-label="cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://discord.com/channels/485586884165107732/728693131557732403/827099289372983336" target="_blank" rel="nofollow noopener noreferrer">I have an ML model that retrains every 24 hours with updated data, but I do not want to create a merge request every time. I just need a nice way to look at the results. Is there a solution on how to report the results of a pipeline in Gitlab?</a></p>
<p>Great question! CML doesn't currently have a feature that takes care of this,
but here are a couple of solutions (only one is needed):</p>
<ol>
<li>Keep a separate branch with unrelated history for committing the reports.</li>
<li>Keep a single report file on the repository and update it with each commit.</li>
</ol>
<p><a href="https://discord.com/channels/485586884165107732/728693131557732403/818450988084101160" target="_blank" rel="nofollow noopener noreferrer">I've run into an error trying to get CML to orchestrate runs in my AWS account. It doesn't seem to be a permissions issue as the <code>AWSEc2FullAccess</code> policy seems to have worked, but I can't see the security group. What could be going wrong?</a></p>
<p>Check to make sure you are deploying to the correct region. Use the argument
<code>--cloud-region <region></code> (<code>us-west</code> for example) to mark the region where the
instance is deployed.</p>
<p><a href="%5Bhttps://discord.com/channels/485586884165107732/728693131557732403/818450988084101160">Head to these docs</a>
for more information on the optional arguments that the CML runner accepts.</p>
<p>Until next month…</p>
<p><img src="https://media.giphy.com/media/XcAa52ejGuNqdb5SFQ/giphy.gif" alt="You Got This Hedgehog GIF by MOODMAN"></p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered and contribute to the MLOps community! 🚀</p>https://dvc.org/blog/may-21-dvc-heartbeathttps://dvc.org/blog/may-21-dvc-heartbeatFri, 21 May 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>It's been another month full of community goodness and we are grateful! Let's
get right to it!</p>
<p><img src="https://media.giphy.com/media/jmqWAjoxFCxJNHD2Kz/giphy.gif" alt="Thank you"></p>
<h3 id="curvenote-with-dvc-tutorials" style="position:relative;">Curvenote with DVC tutorials<a href="#curvenote-with-dvc-tutorials" aria-label="curvenote with dvc tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Interested in versioning your data AND your notebooks?<br>
<a href="https://twitter.com/stevejpurves" target="_blank" rel="nofollow noopener noreferrer">Steve Purves</a> CTO and co-founder of
<a href="https://curvenote.com/" target="_blank" rel="nofollow noopener noreferrer">Curvenote</a> gave a three-part tutuorial on integrating
DVC and Curvenote for creating reproducible, collaborative version control for
data scientists. The videos are beginner accessible with tips for intermediate
git users.
<a href="https://www.youtube.com/watch?v=OnNVbIEIO7A" target="_blank" rel="nofollow noopener noreferrer">Access the videos here.</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3a00f0b45348e1b6f411aed445cf2c8e/03346/curvenote-dvc-integration.jpg" alt="curvenote dvc integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>DVC and
Curvenote for the version control win!</em></p>
<h3 id="cml-with-jenkins-in-dagshub" style="position:relative;">CML with Jenkins in DAGsHub<a href="#cml-with-jenkins-in-dagshub" aria-label="cml with jenkins in dagshub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Next up, <a href="https://www.linkedin.com/in/puneeth-pai-b3b299a1/" target="_blank" rel="nofollow noopener noreferrer">Puneeth Pai</a> of
<a href="https://www.thoughtworks.com/" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks</a> wrote a two-part blog series with
a how-to for achieving continuous machine learning using DVC pipelines with
Jenkins and DAGsHub. Quoted in the article is our own
<a href="https://github.com/DavidGOrtega" target="_blank" rel="nofollow noopener noreferrer">David Ortega</a>,</p>
<blockquote>
<p>Treating experiments like potential new features in a software project opens
up many possibilities for improving our engineering practices.</p>
</blockquote>
<p>Check out these posts at the link below or catch Puneeth at our next
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/278163666/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>
where he will be giving a high level overview of this content as well as
answering questions.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dagshub.com/blog/in-depth-tour-of-jenkinsfile/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">CML with Jenkins in DAGsHub</h4>
<div class="elp-description">The first of a two-part series on how to set up continuous machine learning using DVC pipelines with Jenkins and DAGsHub.</div>
<div class="elp-link">https://dagshub.com/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-05-21/puneeth-gears-a85728868a27748e24f94f8e73f46032.png" alt="CML with Jenkins in DAGsHub">
</div>
</a>
</section>
<p></p>
<h3 id="discord-server-explosion" style="position:relative;">Discord Server Explosion<a href="#discord-server-explosion" aria-label="discord server explosion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a> has exploded since last
month, up 30% in membership 😱, thanks in large part to a
<a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer"><strong>Towards Data Science</strong></a> post from
<a href="https://www.linkedin.com/in/sara-a-metwalli/" target="_blank" rel="nofollow noopener noreferrer">Sara Metwalli</a> recommending
<a href="https://towardsdatascience.com/9-discord-servers-for-math-python-and-data-science-you-need-to-join-today-34214b93d6b8" target="_blank" rel="nofollow noopener noreferrer"><strong>9 Discord Servers for Math, Python, and Data Science You Need to Join Today.</strong></a></p>
<p>Sara encourages readers to connect, learn and get inspired. 🚀 Thanks Sara!
We're on board with that! Rest assured our growing team is hard at work creating
content, improving tools and working on new tools 😶🤗 to continue to grow and
serve our MLOps community!</p>
<h1 id="in-other-mlops-news-" style="position:relative;">In Other MLOps News …<a href="#in-other-mlops-news-" aria-label="in other mlops news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/AndrewYNg" target="_blank" rel="nofollow noopener noreferrer">Andrew NG</a> of
<a href="https://twitter.com/DeepLearningAI_" target="_blank" rel="nofollow noopener noreferrer">Deep Learning AI</a> and
<a href="https://www.coursera.org/" target="_blank" rel="nofollow noopener noreferrer">Coursera</a> fame has just released a new course
specializing in MLOps, called
<a href="https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops?utm_campaign=20210423-mlep-1-program-email-mlep-launch&utm_medium=institutions&_hsmi=126760441&_hsenc=p2ANqtz-9wSUanrnpyWNavtaCEzBLVpDXwatEig_ahaksJQhZO6dKkLRykfOxRwkpAZiipxWej4xs1uQgrXl-JCgB0M-Ha_vCUvEqaswIVZQhNd-jUDsE8SJs&utm_source=deeplearning-ai" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Engineering for Production (MLOps) Specialization</a>.
The course "combines the foundational concepts of machine learning with the
functional expertise of modern software development and engineering roles."
Methodologies and capabilities of MLOps are introduced while addressing the
challenges and consequences of machine learning engineering in production. I'm
signed up! 🙋🏻♀️ How 'bout you?</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops?utm_campaign=20210423-mlep-1-program-email-mlep-launch&utm_medium=institutions&_hsmi=126760441&_hsenc=p2ANqtz-9wSUanrnpyWNavtaCEzBLVpDXwatEig_ahaksJQhZO6dKkLRykfOxRwkpAZiipxWej4xs1uQgrXl-JCgB0M-Ha_vCUvEqaswIVZQhNd-jUDsE8SJs&utm_source=deeplearning-ai" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Machine Learning Engineering for Production (MLOps) Specialization</h4>
<div class="elp-description">Andrew Ng's new course in Coursera providing the foundation to successful and efficient MLOps</div>
<div class="elp-link">https://www.coursera.org/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-05-21/andrew-ng-626e2a8303ad876772f9bb809c95cf54.png" alt="Machine Learning Engineering for Production (MLOps) Specialization">
</div>
</a>
</section>
<p></p>
<p>Next for your learning pleasure,
<a href="https://twitter.com/s_scardapane" target="_blank" rel="nofollow noopener noreferrer">Simone Scardapane</a> is in the process of
fulfilling a "personal challenge" to create a PhD course for
<a href="https://twitter.com/s_scardapane/status/1389240445788643329?s=20" target="_blank" rel="nofollow noopener noreferrer"><strong>Reproducible Deep Learning</strong></a>
that includes the use of open source tools including our own DVC!
<a href="https://github.com/sscardapane/reprodl2021" target="_blank" rel="nofollow noopener noreferrer">Head to the link</a> to star the repo
and cheer him on. We will be! 🙌🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 603px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab974d16e4e3484ec253af5b5feba427/39600/reproducedl.png" alt="reproducedl" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Reproducible Deep Learning PhD
Course</em></p>
<p><a href="https://twitter.com/s_scardapane" target="_blank" rel="nofollow noopener noreferrer">Simone Scardapane</a> is in the process of
fulfilling a "personal challenge" to create a PhD course for
<a href="https://twitter.com/s_scardapane/status/1389240445788643329?s=20" target="_blank" rel="nofollow noopener noreferrer"><strong>Reproducible Deep Learning</strong></a>
that includes the use of open source tools including our own DVC!
<a href="https://github.com/sscardapane/reprodl2021" target="_blank" rel="nofollow noopener noreferrer">Head to the link</a> to star the repo
and cheer him on. We will be! 🙌🏼</p>
<p>You see what I did there, right? <strong>Reproducible</strong>… <strong>Deep Learning</strong>…<br>
Get it? Layers of wit people. I learned from the best! Just wanted to make sure
you were paying attention!</p>
<p><img src="https://media.giphy.com/media/6ra84Uso2hoir3YCgb/giphy.gif" alt="Marvel Studios Smile GIF by Disney+"></p>
<h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>We've hit 30 team members! Our team is distributed all over the world and has
grown so much that we now have two all-hands meetings! Affectionately called
UTC + and UTC -, these meetings honor all our different time zones while
allowing the other group to watch via recording when they are awake! You know
we're all about solving complicated problems. 💪🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4773d87f73f3147561c9517e44c3ce7a/39600/team-map.png" alt="team map" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Our team</em></p>
<h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/svetlana-sachkovskaya/" target="_blank" rel="nofollow noopener noreferrer">Svetlana Sachkovskaya</a> is
originally from Belarus, but is currently living in Poland. She has been a full
stack developer for over seven years. She loves traveling, meeting new people
and is excited to work on open source software. In her spare time you may find
her dancing the tango! 💃🏻 Welcome Sveta!</p>
<p>Exemplifying our diverse team in one fell swoop,
<a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a> has lived in three continents. He has
been working on DVC for a couple of years and is a long-standing contributor to
open source. He now joins us on the CML & Docs teams after completing his PhD in
Medical Imaging. Fun facts about Casper include his becoming the U18 chess
champion of Kenya when he was 14 and being a qualified SCUBA diver. 🤿</p>
<p><a href="https://github.com/iesahin" target="_blank" rel="nofollow noopener noreferrer">Emre Şahin</a> joins us on the DVC team as a technical
writer/ML enthusiast/AI dreamer/tutorial builder from Instanbul, Turkey. A
self-described zealot for technologies, Emre has worked in many development/ML
related projects and has been programming in Python since v. 1.7. We are excited
for Emre to bring you excellent technical content! ✍🏼</p>
<p><a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Sitaula</a> joins us
as a Senior Product Engineer from Nepal. She previously worked as a Principal
Engineer at a tech start up in India and has worked in various capacities in her
career from engineering to project management and communications. Her interests
include learning languages and breaking gender stereotypes. We're right there
with you Tapa! 🚀</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>And we're still hiring!</p>
<p><a href="https://weworkremotely.com/company/iterative" target="_blank" rel="nofollow noopener noreferrer"><strong>Check out our three open roles</strong></a>
for:</p>
<ul>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-front-end-engineer" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Frontend Engineer</strong></a></li>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Sofware Engineer - Open Source, Dev Tools</strong></a>
and</li>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong>.</a></li>
</ul>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h2 id="dvcteam-conference-talks" style="position:relative;">DVCTeam Conference Talks<a href="#dvcteam-conference-talks" aria-label="dvcteam conference talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://mlrepa.com/" target="_blank" rel="nofollow noopener noreferrer">ML Repa Week</a> took place last month and team members gave
three great talks. <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> gave a talk
on data versioning and machine learning experiments on top of Git.
<a href="https://www.linkedin.com/in/drelleobrien/" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> gave a talk on
automating machine learning with Github action and GitLab CI. And finally,
<a href="https://www.linkedin.com/in/mnrozhkov/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> gave a talk on setting
up the workflow for machine learning batch scoring applications using DVC,
MLflow and Airflow. Be sure to check out all three talks and other great talks
from the week long Conference.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.youtube.com/watch?v=OD2KiIOMeMw&list=PLlxErbAvYYLDRP6cHtVP76f2g5Yoh6c5R&index=2" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC: Data Versioning and ML Experiments on Top of Git</h4>
<div class="elp-description">Dmitry Petrov's talk at ML Repa Week on using DVC as an extension of Git for data versioning and machine learning experiments</div>
<div class="elp-link">http://ml-repa.ru/en/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-05-21/dmitry-ml-repa-week-fb9ce5758f68b866a4999eefbd7862ad.png" alt="DVC: Data Versioning and ML Experiments on Top of Git">
</div>
</a>
</section>
<p></p>
<p>
</p><section class="elp-content-holder">
<a href="https://youtu.be/tOo98CtiDJg" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Automating Machine Learning with GitHub Actions & GitLab CI</h4>
<div class="elp-description">Elle O'Brien's conference talk about how to use GitHub actions or GitLab CI to provide automation for your machine learning projects</div>
<div class="elp-link">http://ml-repa.ru/en</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-05-21/elle-ml-repa-week-a7b3d9a03df85303a074882054faec32.png" alt="Automating Machine Learning with GitHub Actions & GitLab CI">
</div>
</a>
</section>
<p></p>
<p>
</p><section class="elp-content-holder">
<a href="https://youtu.be/PYzvLc7o7u0" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Workflow & MLOps for Batch Scoring Applications with DVC, MLflow and Airflow</h4>
<div class="elp-description">Mikhail Rozhkov's talk on how to set up a workflow for batch scoring applications integrating DVC, MLflow and Airlow </div>
<div class="elp-link">http://ml-repa.ru/en</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-05-21/mikhail-ml-repa-week-66dee03e41e7efb2033190ee966e4bc4.png" alt="Workflow & MLOps for Batch Scoring Applications with DVC, MLflow and Airflow">
</div>
</a>
</section>
<p></p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Don't miss our
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>
May 27th at 3:00pm UTC, where we will hear from Puneeth Pai as mentioned above
👆🏽, as well as another user putting DVC and CML into action on his team, and
finally from David Ortega discussing CML pull requests! Bring your questions!
We're here to help!</p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🦉 I'm really enjoying reading through <a href="https://twitter.com/DVCorg">@DVCorg</a>'s documentation and use cases for operationalizing machine learning models.<a href="https://t.co/9k8tSfXbMK">https://t.co/9k8tSfXbMK</a><br><br>If you've ever tried to put a model in production, these concepts will resonate. Check out their open-source project on <a href="https://twitter.com/github">@Github</a>! ✨ <a href="https://t.co/zsSdlivwZk">pic.twitter.com/zsSdlivwZk</a></p>— 👩💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1394389238750326787">May 17, 2021</a></blockquote>
<p>That's quite a shout out! Thanks to
<a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer">Jorge Orpinel</a> and team for always raising
the bar on our docs! Until next month! 👩🏽💻</p>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/experiment-refshttps://dvc.org/blog/experiment-refsMon, 19 Apr 2021 00:00:00 GMT<p>One of the main features provided by DVC is the ability to version machine
learning (ML) pipelines and experiments using Git commits. While this works very
well for versioning mature projects and models, for projects under active
development that may require generating hundreds of experiments or more in a
single day, typical Git workflows can be difficult to work with. This type of
rapid experimentation may appear to fit nicely with the concept of Git feature
branches, but a Git repository with such large numbers of branches will
eventually become too unwieldy to manage.</p>
<p>In DVC 2.0, we’ve introduced a new feature set aimed at simplifying the
versioning of lightweight ML experiments. DVC now provides a series of <a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a>
commands which allow you to easily generate new experiments with modified
hyperparameters, and to quickly compare their results. In this post, we’ll show
how DVC leverages the power of Git references to track each experiment, while
also completely abstracting away the need for you to manually manage a
potentially unlimited number of Git feature branches or tags.</p>
<p><em>Note: This post mainly focuses on the “How?” side of DVC 2.0 experiments. For a
great overview of the “What?” check out our
<a href="https://dvc.org/blog/dvc-2-0-release">2.0 release post</a> and our
<a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">Get Started: Experiments</a> guide.</em></p>
<h2 id="experiments-in-dvc-20" style="position:relative;">Experiments in DVC 2.0<a href="#experiments-in-dvc-20" aria-label="experiments in dvc 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>At the heart of the new experiments feature is the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command.
Whenever a pipeline is executed with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, the results will be
automatically tracked by DVC as a single “experiment”. DVC will track everything
in your workspace as a part of the experiment, including unstaged changes made
prior to execution. This means that DVC experiments can be used to test the
result of changes to DVC-tracked data or pipeline parameters, as well as changes
to Git-tracked code.</p>
<p><img src="https://dvc.org/2021-04-19/exp-run-0e62e88195f222135b89806a7e74915d.gif" alt="Example experiment run" title="Example experiment run"></p>
<p><em>Note: You can follow along with the commands used in this example and
throughout this post, using our
<a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">example-get-started</a>
repository.</em></p>
<p>Now let’s take a deeper look into what actually happened when we ran our
experiment. Starting from the latest commit in our repository’s <code>master</code> branch,
we invoked <a href="https://dvc.org/doc/command-reference/exp/run#--set-param"><code>dvc exp run --set-param</code></a> to generate a new experiment with the
specified parameter value. DVC then reproduced our pipeline as if we had
manually edited our <code>params.yaml</code> to contain that parameter change (setting
<code>featurize.max_features</code> to <code>2000</code>), and then saved the results in a new
experiment named <code>exp-26220</code>.</p>
<p>Returning DVC users will likely be familiar with the typical Git+DVC workflow of
reproducing your pipeline, staging the results in Git, and then Git committing
those changes:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span></span></code></pre></div>
<p>This workflow is now essentially automated within our single <code>exp run</code> command,
with one key difference. Rather than saving the results in a Git <em>branch</em>, the
results are saved in a custom Git <em>reference</em>.</p>
<h2 id="what-is-a-git-reference" style="position:relative;">What is a Git reference?<a href="#what-is-a-git-reference" aria-label="what is a git reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>A Git reference (or ref) is a named reference to a Git commit. References are
addressed via a pathname starting with <code>refs/</code>. Git branches and tags are
actually just references which are stored in the <code>refs/heads</code> and <code>refs/tags</code>
namespaces respectively. In our repo, we can see that:</p>
<p>The tip of our <code>master</code> branch is commit <code>f137703</code>:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show master
</span>commit f137703af59ba1b80e77505a762335805d05d212 (HEAD -> master)
Author: dberenbaum <[email protected]>
Date: Wed Apr 14 14:31:54 2021 -0400
Run experiments tuning random forest params</code></pre></div>
<p><code>master</code> itself is a Git ref (<code>refs/heads/master</code>) pointing to that commit:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref master
</span>f137703af59ba1b80e77505a762335805d05d212 refs/heads/master</code></pre></div>
<h2 id="what-exactly-is-a-dvc-experiment" style="position:relative;">What exactly is a DVC experiment?<a href="#what-exactly-is-a-dvc-experiment" aria-label="what exactly is a dvc experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now, going back to our experiment run, we see that DVC has generated and saved
an experiment named <code>exp-26220</code>. We can even use that name freely within DVC
commands as if it was a Git branch or tag name:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> master exp-26220
</span>Path Metric Old New Change
scores.json avg_prec 0.60405 0.58589 -0.01817
scores.json roc_auc 0.9608 0.945 -0.01581
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc diff</span> master exp-26220
</span>Modified:
data/features/
data/features/test.pkl
data/features/train.pkl
model.pkl
prc.json
roc.json
scores.json
files summary: 0 added, 0 deleted, 0 renamed, 6 modified</code></pre></div>
<p>However, Git tells us that there is no branch or tag named <code>exp-26220</code>, and we
cannot use that name in Git porcelain commands:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git tag</span> <span class="token parameter variable">-l</span>
</span>0-git-init
1-dvc-init
10-bigrams-experiment
11-random-forest-experiments
2-track-data
3-config-remote
4-import-data
5-source-code
6-prepare-stage
7-ml-pipeline
8-evaluation
9-bigrams-model
baseline-experiment
bigrams-experiment
random-forest-experiments
<span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch <span class="token parameter variable">-l</span>
</span>* master
<span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> exp-26220
</span>error: pathspec 'exp-26220' did not match any file(s) known to git</code></pre></div>
<p><em>Note: The Git CLI is divided into
<a href="https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain" target="_blank" rel="nofollow noopener noreferrer">two sets of commands</a>:
the commonly used user-friendly “porcelain” commands (like <code>git checkout</code>) and
the lower level “plumbing” commands.</em></p>
<p>This naturally begs the question, “What is <code>exp-26220</code>?”</p>
<p>The answer is simple, it’s a custom DVC Git ref pointing to a Git commit:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref exp-26220
</span>c42f48168830148b946f6a75d1bdbb25cda46f35 refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220</code></pre></div>
<p><em>Note: that <a href="https://dvc.org/doc/command-reference/exp/show#--sha"><code>dvc exp show --sha</code></a> can be used to view Git commit SHAs for
experiments. Using DVC experiments should never require you to use any of the
low-level Git plumbing commands like <code>git show-ref</code>.</em></p>
<p>If we examine the experiment commit itself, we can see that it is just a regular
commit object that contains our hyperparameter change and the results of the
run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show c42f481
</span>commit c42f48168830148b946f6a75d1bdbb25cda46f35 (refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220)
Author: Peter Rowlands <[email protected]>
Date: Mon Apr 19 04:24:04 2021 +0000
dvc: commit experiment 262206295221319fe5e8ca8a9854d6eb93ec0931fb377488910304cf5ed55f84
diff --git a/dvc.lock b/dvc.lock
index 0e92326..d81fe2b 100644
--- a/dvc.lock
+++ b/dvc.lock
@@ -30,19 +30,19 @@ stages:
size: 2455
params:
params.yaml:
- featurize.max_features: 3000
+ featurize.max_features: 2000
featurize.ngrams: 2
...
diff --git a/scores.json b/scores.json
index 27f6dab..8270914 100644
--- a/scores.json
+++ b/scores.json
@@ -1,4 +1,4 @@
{
- "avg_prec": 0.6040544652105823,
- "roc_auc": 0.9608017142900953
+ "avg_prec": 0.5858888885424922,
+ "roc_auc": 0.944996664954421
}
...</code></pre></div>
<h2 id="dvc-and-custom-git-refs" style="position:relative;">DVC and custom Git refs<a href="#dvc-and-custom-git-refs" aria-label="dvc and custom git refs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In DVC 2.0, we now use the custom <code>refs/exps</code> namespace for storing DVC
experiments in Git. Under the hood, using Git refs allows us to keep using all
of the versioning capabilities provided by Git, without polluting your
repository with actual Git branches and tags. Since the user-friendly Git
porcelain commands (like <code>git checkout</code> and <code>git diff</code>) only resolve branches
and tags (and will ignore custom references), DVC experiments are essentially
hidden from your Git repository (and only visible to DVC commands).</p>
<p>Even though the experiment refs themselves are “invisible” to Git porcelain
commands, Git commit SHAs for experiments can be used in any Git command. This
allows you to leverage the power of tools like <code>git diff</code> to compare things like
code changes between a DVC experiment and any other Git commit (meaning you can
even compare experiment commit SHAs to Git branches or tags).</p>
<p>Likewise, for tools which provide a GUI on top of Git, experiments will be
hidden from your repository in typical use cases:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/463a54c559f3d4d4780e6a20d0acad93/39600/gitk-branches-tags.png" alt="gitk --branches --tags example" title="gitk --branches --tags" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em><code>gitk --branches --tags</code></em></p>
<p>Tools which provide the capability to displaying all Git refs (including custom
namespaces) can also be used to view experiments as if they were Git branches:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0fae615939a6b1d30cce0339e5301a7f/39600/gitk-all.png" alt="gitk --all example screenshot" title="gitk --all" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em><code>gitk --all</code></em></p>
<p>Experiments are also completely local (since custom refs are not transferred to
or from Git remotes on <code>git push</code> and <code>git pull</code>), meaning that even if you run
thousands of experiments locally, you do not need to worry about accidentally
polluting your team’s upstream Github or Gitlab repository with those
experiments. However, individual DVC experiments can be explicitly shared via
remote Git repositories using the <a href="https://dvc.org/doc/command-reference/exp/push"><code>dvc exp push</code></a> and <a href="https://dvc.org/doc/command-reference/exp/pull"><code>dvc exp pull</code></a> commands.
Regular Git branches can also be created from experiments can via
<a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a>.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Prior to version 2.0, DVC already provided a method for versioning (and
reproducing) ML pipelines with Git. By extending DVC's existing capabilities
with the functionality offered by custom Git references, we've created a new
framework for users to easily generate and track their experiments. And when
used in conjunction with the other new features provided in 2.0 (like
<a href="https://dvc.org/doc/command-reference/exp/run#checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpoints versioning</a>
and
<a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">pipeline parametrization</a>),
DVC can now fulfill certain use cases which were unfeasible with typical pre-2.0
DVC + Git workflows, including hyperparameter tuning and deep learning
scenarios.</p>
<p>We hope that whether you are new to DVC or a long time user, you will try out
the new capabilities provided in our 2.0 release. And as always, if you have any
questions, comments or suggestions, please feel free to connect with the DVC
community on <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>,
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/april-21-dvc-heartbeathttps://dvc.org/blog/april-21-dvc-heartbeatFri, 16 Apr 2021 00:00:00 GMT<h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We're starting with the community this month because it has been overflowing
with great content from our users. It's like we're on a sugar high!</p>
<p><img src="https://media.giphy.com/media/oiGCnybFPh6Q8/giphy.gif" alt="Sugar High"></p>
<h3 id="goku-mohandas-new-lessons" style="position:relative;">Goku Mohandas' New Lessons!<a href="#goku-mohandas-new-lessons" aria-label="goku mohandas new lessons permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>First up, <a href="https://twitter.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer">Goku Mahandas</a> of
<a href="https://madewithml.com/" target="_blank" rel="nofollow noopener noreferrer">Made With ML</a> has added this
<a href="https://madewithml.com/courses/mlops/versioning/" target="_blank" rel="nofollow noopener noreferrer">Versioning Lesson</a> to the
popular <strong>MLOps Course</strong> using DVC.<br>
It's RT'ing around the MLOps Twitter space like hotcakes! 🥞</p>
<p>
</p><section class="elp-content-holder">
<a href="https://madewithml.com/courses/mlops/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps - Versioning Code, Data and Models</h4>
<div class="elp-description">Using DVC to version data and models for reproducibility
in a local storage use case</div>
<div class="elp-link">https://madewithml.com/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-04-16/made-with-ml-logo-aecae356305b60ef1f8a39aa3a167d05.png" alt="MLOps - Versioning Code, Data and Models">
</div>
</a>
</section>
<p></p>
<h3 id="ryzal-kamis-tutorial" style="position:relative;">Ryzal Kamis Tutorial<a href="#ryzal-kamis-tutorial" aria-label="ryzal kamis tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/ryzalkamis/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ryzal Kamis</strong></a> of
<a href="https://twitter.com/AISingapore" target="_blank" rel="nofollow noopener noreferrer">AI Singapore</a> has created an
<a href="https://makerspace.aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/" target="_blank" rel="nofollow noopener noreferrer"><strong>in depth tutorial</strong></a>
on data versioning using DVC. This is a follow up article to his
<a href="https://dvc.org/blog/september-20-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">tutorial that was featured in the September Heartbeat.</a>
Thanks Ryzal for this detailed work! 🙏🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://makerspace.aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Versioning for CD4ML - Part 2</h4>
<div class="elp-description">Complete tutorial for beginning continuous integration, automated
testing and versioning, experiment tracking, reproducing the model training
pipeline and creating a Flask app for predictive use of the model </div>
<div class="elp-link">https://makerspace.aisingapore.org/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-04-16/ai-singapore-logo-edbd4b64f8041fff792efadac70c5f57.jpeg" alt="Data Versioning for CD4ML - Part 2">
</div>
</a>
</section>
<p></p>
<h2 id="dvc-used-to-help-in-research-published-in-the-international-journal-of-molecular-sciences-" style="position:relative;">DVC used to help in Research published in the International Journal of Molecular Sciences 🧑🏻🔬<a href="#dvc-used-to-help-in-research-published-in-the-international-journal-of-molecular-sciences-" aria-label="dvc used to help in research published in the international journal of molecular sciences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://www.linkedin.com/in/antonkulaga/" target="_blank" rel="nofollow noopener noreferrer">Anton Kulaga</a> and his team used DVC
pipeline tracking in their research that selects genes connected with maximum
lifespan in mammals. You can check out the
<a href="https://www.mdpi.com/1422-0067/22/3/1073" target="_blank" rel="nofollow noopener noreferrer">paper here</a> as well as their
<a href="https://docs.google.com/document/d/1kI1f62z0Opt8KD4Mf1yrYKftYLOZel3EjbfjDJiQQzg/edit" target="_blank" rel="nofollow noopener noreferrer">pipeline use case here</a>
and their <a href="https://github.com/antonkulaga/yspecies" target="_blank" rel="nofollow noopener noreferrer">GitHub repository.</a></p>
<p>See the diagram of the research below.👇🏼</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/be14fd9336d2db0a3e6ae40ba77b965f/39600/longevity-study.png" alt="longevity study" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>longevity research
diagram</em></p>
<h2 id="dagshub-️-dvc-colab-notebook" style="position:relative;">DAGsHub ❤️ DVC Colab Notebook<a href="#dagshub-%EF%B8%8F-dvc-colab-notebook" aria-label="dagshub ️ dvc colab notebook permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The DevRel team at <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> made
<a href="https://colab.research.google.com/drive/1JJIwAH0TBSY49um5s2FD0GEA6bw3SKrd#scrollTo=cjbAYZDfB3JB" target="_blank" rel="nofollow noopener noreferrer">this cool notebook</a>
that trains a model to classify email as either 'Ham' or 'Spam.' The notebook
shows how to integrate DAGsHub remote storage with DVC to track code and data
files.</p>
<p><img src="https://media.giphy.com/media/7pLv68ItwBaHS/giphy.gif" alt="Robin Williams Thats The Good Stuff GIF"></p>
<h2 id="en-español" style="position:relative;">En Español<a href="#en-espa%C3%B1ol" aria-label="en español permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Yurely Camacho of <a href="http://opensciencelabs.org/" target="_blank" rel="nofollow noopener noreferrer">Open Science Labs</a> created this
blog post on DVC and the advantages of using it for our Spanish speaking
friends! ¡Olé!💃🏻</p>
<p>
</p><section class="elp-content-holder">
<a href="http://opensciencelabs.org/2021/03/22/que-es-el-data-version-control-y-por-que-es-necesario-que-tu-equipo-sepa-como-utilizarlo/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Qué es el Data Version Control y por qué es necesario que tu equipo sepa cómo utilizarlo</h4>
<div class="elp-description">Advantages to using DVC for data version control and team collaboration</div>
<div class="elp-link">http://opensciencelabs.org/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-04-16/open-science-labs-logo-9d053dacd1ee0a6a63a146718d12b20d.png" alt="Qué es el Data Version Control y por qué es necesario que tu equipo sepa cómo utilizarlo">
</div>
</a>
</section>
<p></p>
<h2 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Pick a card, any card… You have not 1, but 3 interviews and talks to choose
from this Heartbeat:</p>
<ul>
<li><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov's</strong></a>
<a href="https://opencv.org/opencv-ai-for-entrepreneurs-unveils-new-podcast-episode/" target="_blank" rel="nofollow noopener noreferrer">interview</a>
with
<a href="https://www.linkedin.com/in/anna-petrovicheva-44b24673/" target="_blank" rel="nofollow noopener noreferrer">Anna Petrovicheva</a>
on <a href="https://twitter.com/opencvlibrary" target="_blank" rel="nofollow noopener noreferrer">Open CV</a></li>
<li>Dmitry's <a href="https://www.youtube.com/watch?v=g3i-9Gk8BiA" target="_blank" rel="nofollow noopener noreferrer">interview</a> with
<a href="https://twitter.com/dswharshit" target="_blank" rel="nofollow noopener noreferrer">Harshit Tyagi</a> of
<a href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ" target="_blank" rel="nofollow noopener noreferrer">Data Science with Harshit</a>,
and</li>
<li>Dmitry's <a href="https://www.youtube.com/watch?v=J8mCr3wVgdA" target="_blank" rel="nofollow noopener noreferrer">talk</a> at the
<a href="https://twitter.com/TMLS_TO" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society</a></li>
</ul>
<p>Spoiler alert ⚠️: You can't choose wrong!</p>
<p><img src="https://media.giphy.com/media/GXrcAztzRX9kI/giphy.gif" alt="Cards GIF"></p>
<h2 id="and-we-keep-on-growing-our-worldwide-team-" style="position:relative;">And we keep on growing our worldwide team! 🌏<a href="#and-we-keep-on-growing-our-worldwide-team-" aria-label="and we keep on growing our worldwide team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are getting to the point where our new hires could take up our whole
Heartbeat! 😅🚀💗</p>
<p><a href="https://www.linkedin.com/in/julianna-galvan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Julie Galvan</strong></a> joins our team
from Houston, Texas as an engineer. She is focused on web development. In her
free time Julie loves reading, especially fantasy fiction (Harry Potter #6 was
fav) and paper crafting. Welcome Julie!🎉</p>
<p><a href="https://www.linkedin.com/in/matt-seddon/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Seddon</strong></a> joins us from Down
Under as a DVC front-end engineer! 🦘 He lives in Kiama, a small town on the
East Coast of Australia. Originally from Scotland, when he's not programming he
likes to spend time with his family away from screens (😅🙌🏼) and he volunteers
for the state emergency service. 🤲🏼</p>
<p><a href="https://www.linkedin.com/in/gaoyanxiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yanxiang Gao</strong></a> (who graciously
allows us to call him Gao) joins us from Hangzhou, China as new DVC engineer.
Gao has a Masters in Physics and has previously worked as a Machine Learning
engineer in Chinese tech companies using DVC. He has been a long time
contributor to DVC and we are so glad to have him on the team now!🎉</p>
<p><a href="https://www.linkedin.com/in/danielkharitonov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Kharitonov</strong></a> joins us
from Stanford, California as a Technical Product Manager Intern. Daniel
graduated from Stanford with Masters CS / AI and PhD MS&E degrees. His previous
industry roles involved working on core routing products at juniper.net, medical
image augmentation with GANs, and synth data generation for autonomous vehicles.
Welcome to the team Daniel! 🙌🏼</p>
<p>Last but not least joining just this week,
<a href="https://www.linkedin.com/in/milecia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> joins us as a
Developer Advocate from Tulsa, Oklahoma. Milecia has a background in mechanical
and aerospace engineering, some machine learning on autonomous vehicles, and
basically everything that the web touches. She also practices kung fu in her
free time.🥋🙇🏻♀️ We think that's "Oklahoma, OK!" 👌🏼</p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Even with all our new hires, we're still building!</p>
<p><a href="https://weworkremotely.com/company/iterative" target="_blank" rel="nofollow noopener noreferrer"><strong>Check out our three open roles</strong></a>
for:</p>
<ul>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-frontend-engineer" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Frontend Engineer</strong></a></li>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Sofware Engineer - Open Source, Dev Tools</strong></a>
and</li>
<li><a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong>.</a></li>
</ul>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Don't miss our
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>
April 28th at 3:00pm UTC, where we will be demo-ing Pipelines and CML! Bring
your questions! We're here to help!</p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">DVC is an amazing tool. Great milestone. <br><br>It already removes a lot of headaches in my <a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a> work. <br><br>But with new features, I will be even more productive :) <a href="https://t.co/pMyVXS292j">https://t.co/pMyVXS292j</a></p>— Vladimir Iglovikov (@viglovikov) <a href="https://twitter.com/viglovikov/status/1367193818152411137">March 3, 2021</a></blockquote>
<p>We love removing your headaches! 🙌🏼 You're all caught up! See you at the next
Community Gems 💎!</p>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/march-21-community-gemshttps://dvc.org/blog/march-21-community-gemsWed, 31 Mar 2021 00:00:00 GMT<h3 id="q-will-dvc-work-with-my-remote-cloud-storage-of-choice" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/821493606770409493" target="_blank" rel="nofollow noopener noreferrer">Q: Will DVC work with <my remote cloud storage of choice?></a><a href="#q-will-dvc-work-with-my-remote-cloud-storage-of-choice" aria-label="q will dvc work with my remote cloud storage of choice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We recently had questions about this, specifically regarding Huawei Cloud and
Backblaze B2 Storage. The answer is any cloud storage that has an S3 interface
will work with DVC and both of the aforementioned do! In addition DVC works with
Azure, Google Drive, GS, OSS, and SSH.
<a href="https://dvc.org/doc/command-reference/remote" target="_blank" rel="nofollow noopener noreferrer">Learn more about S3 combatibility integrations and all available remote storage capabilities here.</a></p>
<p>Thanks to @luke and @Samuel H from Discord for asking these questions that led
to this Gem! 💎</p>
<h3 id="q-i-had-understood-previously-that-dvc-was-not-suitable-for-hyperparameter-tuning-has-that-changed" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/820722752709328967" target="_blank" rel="nofollow noopener noreferrer">Q: I had understood previously that DVC was not suitable for hyperparameter tuning. Has that changed?</a><a href="#q-i-had-understood-previously-that-dvc-was-not-suitable-for-hyperparameter-tuning-has-that-changed" aria-label="q i had understood previously that dvc was not suitable for hyperparameter tuning has that changed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes indeed! With DVC 2.0, the capabilities have evolved quite a bit! We have
introduced experiments and metrics which enables you to track and compare the
different runs of your models with various hyperparameters. You can check out
the documents <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">here</a> and
<a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">here</a> to see all the
details.</p>
<p>Thanks to @saif3r for helping us highlight the new features in DVC!</p>
<h3 id="q-is-it-possible-to-set-up-a-dvc-repo-with-pipelines-which-have-all-the-data-cache-input-output-on-another-local-location-outside-the-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/819509440217874473" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to set up a DVC repo with pipelines which have all the data (cache, input, output) on another (local) location outside the repo?</a><a href="#q-is-it-possible-to-set-up-a-dvc-repo-with-pipelines-which-have-all-the-data-cache-input-output-on-another-local-location-outside-the-repo" aria-label="q is it possible to set up a dvc repo with pipelines which have all the data cache input output on another local location outside the repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Thanks for the question @EEisbrenner!</p>
<p>One solution to this would be to keep your DVC cache on your mount, and use the
<code>symlink</code> cache type so all of your data would remain on that mount, but for
DVC's purposes it would only deal with files that are "inside" your repo (via
symlinks). Note that your data on that mount would be stored in DVC's
content-addressable cache format, and not in <code>path/to/mount/foo.nc</code>. Check out
the docs on
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">how to keep DVC cache on your mount here.</a></p>
<p>To actually work with <code>foo.nc</code>, you'd end up with a symlink <code>foo.nc</code> inside your
git/DVC repo that points to some object in your DVC cache.<br>
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">See these docs</a> for
info on how the cache link types work. For doing the initial <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> step for
your data without needing to copy it into the DVC/repo first,
<a href="https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache" target="_blank" rel="nofollow noopener noreferrer">check out these docs</a>.</p>
<h3 id="q-my-peers-and-i-share-a-repo-where-we-have-a-folder-that-is-versioned-with-dvc-im-getting-an-error-message-when-trying-to-pull-data-from-the-cloud-what-could-be-causing-it" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799617584336338954" target="_blank" rel="nofollow noopener noreferrer">Q: My peers and I share a repo where we have a folder that is versioned with DVC. I'm getting an error message when trying to pull data from the cloud. What could be causing it?</a><a href="#q-my-peers-and-i-share-a-repo-where-we-have-a-folder-that-is-versioned-with-dvc-im-getting-an-error-message-when-trying-to-pull-data-from-the-cloud-what-could-be-causing-it" aria-label="q my peers and i share a repo where we have a folder that is versioned with dvc im getting an error message when trying to pull data from the cloud what could be causing it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>I see you are having the following error:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span>
</span>
Everything is up to date.
ERROR: failed to pull data from the cloud - 'data\rhinoceros.dvc' format error: extra keys not allowed @ data['outs'][0]['size']
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc doctor</span>
</span>
DVC version: 1.9.1 (exe)
---------------------------------
Platform: Python 3.7.9 on Windows-10-10.0.19041-SP0
Supports: All remotes
Cache types: hardlink
Cache directory: NTFS on C:\
Workspace directory: NTFS on C:\
Repo: dvc, git</code></pre></div>
<p>You're colleague is likely running a newer version of DVC. Upgrade so that all
are on the same version and you will be good to go!</p>
<p>Thanks @ojon for this important gem! 💎</p>
<h3 id="q-how-do-i-create-multiple-pipeline-dvcyaml-files-for-different-experiments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/824846339288334356" target="_blank" rel="nofollow noopener noreferrer">Q: How do I create multiple pipeline (<code>dvc.yaml</code>) files for different experiments?</a><a href="#q-how-do-i-create-multiple-pipeline-dvcyaml-files-for-different-experiments" aria-label="q how do i create multiple pipeline dvcyaml files for different experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You could create separate directories for each experiment and keep your
pipelines organized with separate <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files. You can find more
information on
<a href="https://dvc.org/doc/user-guide/experiment-management#organization-patterns" target="_blank" rel="nofollow noopener noreferrer">organization patterns for experiments here.</a>
Currently we are working on a way to compare metrics between different paths if
using this method of keeping experiments in different directories.
<a href="https://github.com/iterative/dvc/issues/5074" target="_blank" rel="nofollow noopener noreferrer">You can follow that issue here!</a></p>
<p>Thanks @tijoseymathew for your question in Discord!</p>
<h3 id="q-is-there-a-way-to-run-git-checkout-and-dvc-checkout-in-one-command" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/818488624303046677" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to run "git checkout and "dvc checkout" in one command?</a><a href="#q-is-there-a-way-to-run-git-checkout-and-dvc-checkout-in-one-command" aria-label="q is there a way to run git checkout and dvc checkout in one command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yep! There's a way! We offer a Git hook for <code>post-checkout</code>, which automates DVC
checkout right after <code>git checkout</code>. You can use <a href="https://dvc.org/doc/command-reference/install"><code>dvc install</code></a> to install that
hook.<br>
<a href="https://dvc.org/doc/command-reference/install" target="_blank" rel="nofollow noopener noreferrer">Check out these docs</a> for all
the info on installing Git hooks
<a href="https://dvc.org/doc/command-reference/install#example-checkout-both-git-and-dvc" target="_blank" rel="nofollow noopener noreferrer">and here</a>
for a specific example!</p>
<p>Many thanks to @Thyrix for this question!</p>
<h3 id="q-how-do-i-set-a-remote-in-google-drive-and-share-with-someone-else" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/819432969260761131" target="_blank" rel="nofollow noopener noreferrer">Q: How do I set a remote in Google Drive and share with someone else?</a><a href="#q-how-do-i-set-a-remote-in-google-drive-and-share-with-someone-else" aria-label="q how do i set a remote in google drive and share with someone else permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote" target="_blank" rel="nofollow noopener noreferrer">These docs</a> will show
you how to get a remote Google Drive set up! Be sure to setup the remote
folder's permissions! For more information on sharing permissions in Google
Drive
<a href="https://support.google.com/drive/answer/7166529?co=GENIE.Platform%3DDesktop&hl=en" target="_blank" rel="nofollow noopener noreferrer">see these docs.</a></p>
<p>Thanks @Carlos Lopez H for this important gem! 💎</p>
<p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p>
<p>At our April Office Hours Meetup we will be demo-ing pipelines as well as CML.
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660/?isFirstPublish=true" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a>
to stay up to date with specifics as we get closer to the event!</p>
<p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and
CML questions answered!</p>https://dvc.org/blog/March-21-dvc-heartbeathttps://dvc.org/blog/March-21-dvc-heartbeatMon, 15 Mar 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Welcome to March! It's been a great month already! Here's all that will keep you
in the know.</p>
<p><img src="https://media.giphy.com/media/J2gg8fO7RarRgQRC4d/giphy.gif" alt="UnderRock"></p>
<h2 id="icymi---dvc-20-is-here" style="position:relative;">ICYMI - DVC 2.0 is here!<a href="#icymi---dvc-20-is-here" aria-label="icymi dvc 20 is here permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you somehow missed our
<a href="https://dvc.org/blog/dvc-2-0-release" target="_blank" rel="nofollow noopener noreferrer">March 3rd announcment</a>, DVC 2.0 is here
with loads of features to make your life easier.</p>
<p>🧪 Lightweight ML experiments</p>
<p>📍 ML model checkpoints versioning</p>
<p>📈 Dvc-live - new open-source library for metrics logging</p>
<p>🔗 ML pipeline templating and iterative foreach-stages</p>
<p>🤖 CML - new way to get GPU/CPU in clouds and GitHub Actions</p>
<p>This video from the team gives you an overview of all the new features.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h-ioXYurEJo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="and-we-keep-on-growing-our-worldwide-team-" style="position:relative;">And we keep on growing our worldwide team! 🌏<a href="#and-we-keep-on-growing-our-worldwide-team-" aria-label="and we keep on growing our worldwide team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have three new team members this month!</p>
<p><a href="https://www.linkedin.com/in/duijf/" target="_blank" rel="nofollow noopener noreferrer"><strong>Laurens Duijvesteijn</strong></a> joins the team
from Utrecht, The Netherlands as a backend infrastructure engineer. Previously
he led a devops team at Channable where he learned that he really enjoys working
on developer tools and empowering people to do great work. When not solving dev
challenges, he enjoys bouldering/climbing, snowboarding and hiking! Welcome
Laurens!</p>
<p><a href="https://github.com/0x2b3bfa0" target="_blank" rel="nofollow noopener noreferrer"><strong>Helio Machado</strong></a> joins our team from Spain as a
CML engineer! Helio comes from a heutogogic background, mainly focused on the
Free and Open Source culture and technologies from a systems perspective. You
will find his clever cryptograph handle helping you out in Discord with your CML
questions. Fun fact: Our two CML engineers, Helio and David Ortega live just 300
km apart in Spain! CML has some Spanish flare! 💃🏻🇪🇸</p>
<p><a href="https://www.linkedin.com/in/mikhail-rozhkov-33549118/" target="_blank" rel="nofollow noopener noreferrer"><strong>MikHail Rozhkov</strong></a>
joins us from Moscow, Russia as a Solution Engineer. Mikhail has been working
with DVC for 2+ years in the banking industry and is also the creator of the
<a href="https://mlrepa.com" target="_blank" rel="nofollow noopener noreferrer"><strong>Machine Learning REPA</strong></a> community as well as created our
<a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/" target="_blank" rel="nofollow noopener noreferrer"><strong>first course on Udemy</strong></a>.
We are so excited to have him officially join our team full-time!</p>
<p><img src="https://media.giphy.com/media/3ohhwznAY9PN08m0H6/giphy.gif" alt="Join Us"></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Come join our team! Open positions this month:</p>
<p><a href="https://docs.google.com/document/d/1aT5HZYt4kAUxXqD4JNTe3jPDlVUwSmnEWDPR2QoKdvo/edit" target="_blank" rel="nofollow noopener noreferrer">TypeScript Front-End Engineer</a>
to build SaaS and a VS Code UI for our popular machine learning tools: DVC and
CML. The ML tools ecosystem is what JS space was 10 years ago. Come join us on
this exciting project!</p>
<p>Our search continues for a
<a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer">Developer Advocate</a>
to support and inspire developers by creating new content like blogs, tutorials,
and videos - plus lead outreach through meetups and conferences.</p>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h2 id="dmitry-featured-on-tfir-insights" style="position:relative;">Dmitry featured on TFIR Insights<a href="#dmitry-featured-on-tfir-insights" aria-label="dmitry featured on tfir insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/tfir_io" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> of
<a href="https://www.tfir.io/" target="_blank" rel="nofollow noopener noreferrer">TFIR Insights</a> interviewed our very own CEO,
<a href="https://twitter.com/fullstackml" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a>, on his show discussing:</p>
<ul>
<li>Iterative.ai</li>
<li>Why Open Source is a better approach for AI/ML</li>
<li>DVC and CML</li>
<li>Who should care about these tools</li>
<li>How DVC and CML stack up against proprietary AI Platforms such as AWS
SageMaker and Microsoft Azure ML Engineer</li>
</ul>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/lv2cpm9Pduk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="elle-at-datatalksclub-conference" style="position:relative;">Elle at DataTalks.Club Conference<a href="#elle-at-datatalksclub-conference" aria-label="elle at datatalksclub conference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://twitter.com/drelleobrien" target="_blank" rel="nofollow noopener noreferrer"><strong>Elle O'Brien</strong></a> presents her talk
"Automating ML with Continuous Integration" at the
<a href="http://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">DataTalks.Club</a> Conference with
<a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer"><strong>Alexey Grigorev</strong></a> and
<a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer"><strong>Demtrios Brinkmann</strong></a> of
<a href="https://open.spotify.com/show/7wZygk3mUUqBaRbBGB1lgh" target="_blank" rel="nofollow noopener noreferrer">MLOps Community</a>. You can
catch her talk starting at 3:03 below. 👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.youtube.com/watch?v=og1DG1KZ71c&t=11382s" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Automating ML with Continuous Integration</h4>
<div class="elp-description">Elle O'Brien, PhD presents at DataTalks.Club Conference</div>
<div class="elp-link">DataTalks.Club</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-03-15/confused-animals-3a01f72852765a7c4ced04e0819e8ba2.png" alt="Automating ML with Continuous Integration">
</div>
</a>
</section>
<p></p>
<h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="using-dvc-in-lab-data-management" style="position:relative;">Using DVC in Lab Data Management<a href="#using-dvc-in-lab-data-management" aria-label="using dvc in lab data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This great tutorial from <a href="https://mti-lab.github.io/blog/" target="_blank" rel="nofollow noopener noreferrer">Matsui-lab Blog</a>
provides a solution using DVC for the data management problem labs face.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mti-lab.github.io/blog/yusuke%20matsui/education/labops/2021/03/03/dvc.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Versioning a Shared Dataset Using DVC and S3</h4>
<div class="elp-description">DVC solution in a lab environment</div>
<div class="elp-link">mti-lab.github.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-03-15/matsui-lab-blog-ed064db061f5e0f5ca1ce475fad16fe3.png" alt="Versioning a Shared Dataset Using DVC and S3">
</div>
</a>
</section>
<p></p>
<h3 id="healthcare-use-case-video-tutorial" style="position:relative;">Healthcare Use Case Video Tutorial<a href="#healthcare-use-case-video-tutorial" aria-label="healthcare use case video tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.linkedin.com/in/danial-senejohnny/" target="_blank" rel="nofollow noopener noreferrer"><strong>Danial Senejohnny</strong></a> created
this video outlining the use of DVC for healthcare institutes where the data
must be kept private and on premise data store is preferred. 👇🏼</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/K1iyWr4Z6go?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="scientific-journals-" style="position:relative;">Scientific Journals 🧑🏻🔬<a href="#scientific-journals-" aria-label="scientific journals permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are excited to announce a scientific paper purely devoted to DVC coming out
from Queen's University. This publication by
<a href="https://www.linkedin.com/in/amine-barrak-0bb99160/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amine Barrak</strong></a>,
<a href="https://www.linkedin.com/in/elliseghan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ellis E Eghan</strong></a> and
<a href="https://www.linkedin.com/in/bramadams/" target="_blank" rel="nofollow noopener noreferrer"><strong>Bram Adams</strong></a>, will be presented at
the 28th IEEE International Conference on Software Analysis, Evolution, and
Reengineering. You can check it out here. 👇🏼</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mcis.cs.queensu.ca/publications/2021/saner.pdf" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects</h4>
<div class="elp-description">Empirical Study of DVC Projects</div>
<div class="elp-link">mcis.cs.queensu.ca</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-03-15/EmpiricalStudyDVC-3e11b88175803e4a49d528d1f008126d.png" alt="On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects">
</div>
</a>
</section>
<p></p>
<p>This article by <strong>Samuel Idowu</strong>,
<a href="https://www.linkedin.com/in/daniel-g-str%C3%BCber-359134100/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Struber</strong></a>,
and
<a href="https://www.linkedin.com/in/thorsten-berger-3a6a851ab/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thorsten Berger</strong></a>,
reviews a number of asset management tools for machine learning including DVC,
that solve the commonly reported ML engineering challenges.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://arxiv.org/pdf/2102.06919.pdf" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Asset Management in Machine Learning: A Survey</h4>
<div class="elp-description">Steps to use DVC in your data versioning</div>
<div class="elp-link">arxiv.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-03-15/arxiv-89cc24e1d73a143584fc0fb6a35d39a5.png" alt="Asset Management in Machine Learning: A Survey">
</div>
</a>
</section>
<p></p>
<p><img src="https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif" alt="ScienceMindBlown"></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>From a Portuguese speaking community member in Finland…</p>
<blockquote>
<p>"The @DVCorg surely it is among the best tools of the ecosystem of the last 3
years. It won't be long before DVC is as common as Scikit-Learn in ML / DS
projects with high maturity. 👏🏼👏🏼👏🏼"</p>
</blockquote>
<p>O <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a> seguramente está entre as melhores
ferramentas do ecossistema dos últimos 3 anos. Não vai demorar para o DVC ser
tão comum quanto o Scikit-Learn em projetos de ML/DS com alta maturidade. 👏👏👏
<a href="https://t.co/nnfecYoTQv" target="_blank" rel="nofollow noopener noreferrer">https://t.co/nnfecYoTQv</a></p>
<p>— Flávio Clésio March 3, 2021</p>
<p>We think so too! 🙌🏼 You're all caught up! See you at the next Community Gems 💎!</p>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/dvc-2-0-releasehttps://dvc.org/blog/dvc-2-0-releaseWed, 03 Mar 2021 00:00:00 GMT<h2 id="tldr-video" style="position:relative;">TL;DR; video<a href="#tldr-video" aria-label="tldr video permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h-ioXYurEJo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="what-is-new-in-dvc-20" style="position:relative;">What is new in DVC 2.0?<a href="#what-is-new-in-dvc-20" aria-label="what is new in dvc 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have been working on DVC for almost 4 years. In the previous versions, we
have built a great foundation on versioning data, code and ML models that helps
make your ML projects reproducible.</p>
<p>With the 2.0 release, we are going deeper into machine learning and deep
learning scenarios such as <strong>experiment management</strong>, <strong>ML model checkpoints</strong>
and <strong>ML metrics logging</strong>. These scenarios are widely adopted by ML
practitioners and instrumented with custom tools or external frameworks and SaaS
services. <strong>Our vision</strong> is to make the ML experimentation experience
distributed (like Git) and independent of external SaaS platforms, and to
introduce proper data and model management to ML experiments.</p>
<p>⚠️ DVC 2.0 is the first release with ML experements, which is still in
experimentation mode (yeah, experiments in experimentation mode 😅), so the API
might change a bit in the following releases.</p>
<p><strong>ML pipelines parametrization</strong> is another big improvement in DVC 2.0. This was
the most requested feature during the last year. We are introducing variables in
pipelines as well as foreach-stages. This is a significant improvement for users
who work on multi-stages ML projects, which is very common for NLP projects.</p>
<p>A better <strong>CPU/GPU resource allocation</strong> is another important direction for DVC.
Together with DVC 2.0 we are releasing new version 0.3 of CML (CI/CD for ML). It
aims to hide all complexity of clouds from data scientists and ML engineers. We
developed a brand new Iterative Terraform Provider to reach this goal and
simplify the end-user experience. In future releases, we expect DVC to use this
Terraform provider to access cloud resources directly.</p>
<p>The last but not least important part - we made the new release with <strong>minimum
breaking changes to our API</strong>. That makes migration to DVC 2.0 smooth and
low-risk.</p>
<h2 id="install" style="position:relative;">Install<a href="#install" aria-label="install permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The new version is generally available!</p>
<p>Install DVC 2.0 <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">through OS packages</a> or as Python
library:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> dvc</span></code></pre></div>
<p>CML is pre-installed in the CML docker containers (e.g.
<code>iterativeai/cml:0-dvc2-base1</code>) and also available as an NPM package:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</span></code></pre></div>
<h2 id="lightweight-ml-experiments" style="position:relative;">Lightweight ML experiments<a href="#lightweight-ml-experiments" aria-label="lightweight ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC uses Git versioning as the basis for ML experiments. This solid foundation
makes each experiment reproducible and accessible from the project's history.
This Git-based approach works very well for ML projects with mature models when
only a few new experiments per day are run.</p>
<p>However, in more active development, when dozens or hundreds of experiments need
to be run in a single day, Git creates overhead — each experiment run requires
additional Git commands <code>git add/commit</code>, and comparing all experiments is
difficult.</p>
<p>We are introducing lightweight experiments in DVC 2.0! This is how you can
auto-track ML experiments without any overhead.</p>
<p>⚠️ Note, our new ML experiment features (<a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a>) are experimental. This means
that the commands might change a bit in the following minor releases.</p>
<p><a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> can run an ML experiment with a new hyperparameter from
<code>params.yaml</code> while <a href="https://dvc.org/doc/command-reference/exp/diff"><code>dvc exp diff</code></a> shows metrics and params difference:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">3000</span>
</span>
Reproduced experiment(s): exp-bb55c
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp diff</span>
</span>Path Metric Value Change
scores.json auc 0.57462 0.0072197
Path Param Value Change
params.yaml featurize.max_features 3000 1500</code></pre></div>
<p>More experiments:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">4000</span>
</span>Reproduced experiment(s): exp-9bf22
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span>
</span>Reproduced experiment(s): exp-63ee0
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.ngrams</span><span class="token operator">=</span><span class="token number">3</span>
</span>Reproduced experiment(s): exp-80655
Experiment results have been applied to your workspace.</code></pre></div>
<p>In the examples above, hyperparameters were changed with the <code>--set-param</code>
option, but you can make these changes by modifying the params file instead. In
fact <em>any code can be changed</em> and <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> will capture the variations.</p>
<p>See all the runs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--include-params</span> featurize.max_features,featurize.ngrams</span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.max_features<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.ngrams<span class="token hide">**</span></span>
</span> ─────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 0.56359 5000 3
master 0.5674 1500 2
├── exp-80655 0.56359 5000 3
├── exp-63ee0 0.5515 5000 2
├── exp-9bf22 0.56448 4000 2
└── exp-bb55c 0.57462 3000 2
</span> ─────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Under the hood, DVC uses Git to store the experiments' meta-information. A
straight-forward implementation would create visible branches and auto-commit in
them, but that approach would over-pollute the branch namespace very quickly. To
avoid this issue, we introduced custom Git references <code>exps</code>, the same way as
GitHub uses custom references <code>pulls</code> to track pull requests (this is an
interesting technical topic that deserves a separate blog post). Below you can
see how it works.</p>
<p>No artificial branches, only custom references <code>exps</code> (do not worry if you don't
understand this part - it is an implementation detail):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch
</span>* master
<span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref
</span>5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655
f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0
0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c
9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22</code></pre></div>
<p>The best experiment can be promoted to the workspace and committed to Git.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-bb55c
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">'optimize max feature size'</span></span></code></pre></div>
<p>Alternatively, an experiment can be promoted to a branch (<code>big_fr_size</code> branch
in this case):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-80655 big_fr_size
</span>Git branch 'big_fr_size' has been created from experiment 'exp-c695f'.
To switch to the new branch run:
git checkout big_fr_size</code></pre></div>
<p>Remove all the experiments that were not used:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--force</span></span></code></pre></div>
<h2 id="ml-model-checkpoints-versioning" style="position:relative;">ML model checkpoints versioning<a href="#ml-model-checkpoints-versioning" aria-label="ml model checkpoints versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>ML model checkpoints are an essential part of deep learning. ML engineers prefer
to save the model files (or weights) at checkpoints during a training process
and return back when metrics start diverging or learning is not fast enough.</p>
<p>The checkpoints create a different dynamics around ML modeling process and need
a special support from the toolset:</p>
<ol>
<li>Track and save model checkpoints (DVC outputs) periodically, not only the
final result or training epoch.</li>
<li>Save metrics corresponding to each of the checkpoints.</li>
<li>Reuse checkpoints - warm-start training with an existing model file,
corresponding code, dataset version and metrics.</li>
</ol>
<p>This new behavior is supported in DVC 2.0. Now, DVC can version all your
checkpoints with corresponding code and data. It brings the reproducibility of
DL processes to the next level - every checkpoint is reproducible.</p>
<p>This is how you define checkpoints with live-metrics:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token parameter variable">-n</span> train <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span>
<span class="token parameter variable">-p</span> dropout,epochs,lr,process <span class="token punctuation">\</span>
<span class="token parameter variable">--checkpoint</span> model.h5 <span class="token punctuation">\</span>
<span class="token parameter variable">--live</span> logs <span class="token punctuation">\</span>
python train.py
</span>
Creating 'dvc.yaml'
Adding stage 'train' in 'dvc.yaml'</code></pre></div>
<p>Note, we use <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command instead of <code>dvc run</code>. Starting from DVC 2.0
we begin extracting all stage specific functionality under <a href="https://dvc.org/doc/command-reference/stage"><code>dvc stage</code></a> umbrella.
<code>dvc run</code> is still working, but will be deprecated in the following major DVC
version (most likely in 3.0).</p>
<p>Start the training process and interrupt it after 5 epochs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>'users.csv.dvc' didn't change, skipping
Running stage 'train':
> python train.py
...
^CTraceback (most recent call last):
...
KeyboardInterrupt</code></pre></div>
<p>Navigate in checkpoints:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 4 2.0702 0.30388 2.025 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 4 2.0702 0.30388 2.025 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Each of the checkpoints above is a separate experiment with all data, code,
paramaters and metrics. You can use the same <a href="https://dvc.org/doc/command-reference/exp/apply"><code>dvc exp apply</code></a> command to extract
any of these.</p>
<p>Another run continues this process. You can see how accuracy metrics are
increasing - DVC does not remove the model/checkpoint and training code trains
on top of it:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Existing checkpoint experiment 'exp-e15bc' will be resumed
...
^C
KeyboardInterrupt
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 9 1.7845 0.58125 1.7381 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────</code></pre></div>
<p>After modifying the code, data, or params, the same process can be resumed. DVC
recognizes the change and shows it (see experiment <code>b363267</code>):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train.py <span class="token comment"># modify code</span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> params.yaml <span class="token comment"># modify params</span>
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Modified checkpoint experiment based on 'exp-e15bc' will be created
...
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 13 1.5841 0.69262 1.5381 … 15 …
master - - - - … 5 …
│ ╓ exp-7ff06 13 1.5841 0.69262 1.5381 … 15 …
│ ╟ 6c62fec 12 1.6325 0.67248 1.5857 … 15 …
│ ╟ 4baca3c 11 1.6817 0.64855 1.6349 … 15 …
│ ╟ b363267 (2b06de7) 10 1.7323 0.61925 1.6857 … 15 …
│ ╓ 2b06de7 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Sometimes you might need to train the model from scratch. The reset option
removes the checkpoint file before training: <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a>.</p>
<h2 id="metrics-logging" style="position:relative;">Metrics logging<a href="#metrics-logging" aria-label="metrics logging permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Continuously logging ML metrics is a very common practice in the ML world.
Instead of a simple command-line output with the metrics values, many ML
engineers prefer visuals and plots. These plots can be organized in a "database"
of ML experiments to keep track of a project. There are many special solutions
for metrics collecting and experiment tracking such as sacred, mlflow, weight
and biases, neptune.ai, or others.</p>
<p>With DVC 2.0, we are releasing a new open-source library
<a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVC-Live</a> that provides functionality for
tracking model metrics and organizing metrics in simple text files in a way that
DVC can visualize the metrics with navigation in Git history. So, DVC can show
you a metrics difference between the current model and a model in <code>master</code> or
any other branch.</p>
<p>This approach is similar to the other metrics tracking tools with the difference
that Git becomes a "database" or of ML experiments.</p>
<h3 id="generate-metrics-file" style="position:relative;">Generate metrics file<a href="#generate-metrics-file" aria-label="generate metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Install the library:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> dvclive</span></code></pre></div>
<p>Instrument your code:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvclive
<span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DvcLiveCallback
dvclive<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token string">"logs"</span><span class="token punctuation">)</span> <span class="token comment">#, summarize=True)</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token comment"># Set up DVC-Live callback:</span>
callbacks<span class="token operator">=</span><span class="token punctuation">[</span> DvcLiveCallback<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span>
<span class="token punctuation">)</span>
</code></pre></div>
<p>During the training you will see the metrics files that are continuously
populated each epochs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> logs/
</span>accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv
<span class="token line"><span class="token input">$ </span><span class="token command">head</span> logs/accuracy.tsv
</span>timestamp step accuracy
1613645582716 0 0.7360000014305115
1613645585478 1 0.8349999785423279
1613645587322 2 0.8830000162124634
1613645589125 3 0.9049999713897705
1613645590891 4 0.9070000052452087
1613645592681 5 0.9279999732971191
1613645594490 6 0.9430000185966492
1613645596232 7 0.9369999766349792
1613645598034 8 0.9430000185966492</code></pre></div>
<p>In addition to the continuous metrics files, you will see the summary metrics
file and HTML file with the same file prefix. The summary file contains the
result of the latest epoch:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> logs.json <span class="token operator">|</span> python <span class="token parameter variable">-m</span> json.tool
</span>{
"step": 41,
"loss": 0.015958430245518684,
"accuracy": 0.9950000047683716,
"val_loss": 13.705962181091309,
"val_accuracy": 0.5149999856948853
}</code></pre></div>
<p>The HTML file contains all the visuals for continuous metrics as well as the
summary metrics on a single page:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b66f0f1e2076cdf2661acb4f621e7255/39600/dvclive-html.png" alt="dvclive html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Note, the HTML and the summary metrics files are generating automatically for
each. So, you can monitor model performance in realtime.</p>
<h3 id="git-navigation-with-the-metrics-file" style="position:relative;">Git-navigation with the metrics file<a href="#git-navigation-with-the-metrics-file" aria-label="git navigation with the metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC repository is NOT required to use the live metrics functionality from the
above. It works independently from DVC.</p>
<p>DVC repository becomes useful when the metrics and plots are committed in your
Git repository, and you need navigation around the metrics.</p>
<p>Metrics difference between workspace and the last Git commit:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git status</span> <span class="token parameter variable">-s</span>
</span> M logs.json
M logs/accuracy.tsv
M logs/loss.tsv
M logs/val_accuracy.tsv
M logs/val_loss.tsv
M train.py
?? model.h5
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json
</span>Path Metric Old New Change
logs.json accuracy 0.995 0.99 -0.005
logs.json loss 0.01596 0.03036 0.0144
logs.json step 41 36 -5
logs.json val_accuracy 0.515 0.5175 0.0025
logs.json val_loss 13.70596 3.29033 -10.41563</code></pre></div>
<p>The difference between a particular commit/branch/tag or between two commits:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json HEAD^ 47b85c
</span>Path Metric Old New Change
logs.json accuracy 0.995 0.998 0.003
logs.json loss 0.01596 0.01951 0.00355
logs.json step 41 82 41
logs.json val_accuracy 0.515 0.51 -0.005
logs.json val_loss 13.70596 5.83056 -7.8754</code></pre></div>
<p>The same Git-navigation works with the plots:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">--target</span> logs
</span>file:///Users/dmitry/src/exp-dc/plots.html</code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cdc4ec4dabed1d7de6b8606667ebfc83/39600/dvclive-diff-html.png" alt="dvclive diff html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Another nice thing about the live metrics - they work across ML experiments and
checkpoints, if properly set up in dvc stages. To set up live metrics, you need
to specify the metrics directory in the <code>live</code> section of a stage:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">live</span><span class="token punctuation">:</span>
<span class="token key atrule">logs</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">report</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data</code></pre></div>
<h2 id="ml-pipelines-parameterization-and-foreach-stages" style="position:relative;">ML pipelines parameterization and foreach stages<a href="#ml-pipelines-parameterization-and-foreach-stages" aria-label="ml pipelines parameterization and foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>After introducing the multi-stage pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it was quickly
adopted among our users. The DVC team got tons of positive feedback from them,
as well as feature requests.</p>
<h3 id="pipeline-parameters-from-vars" style="position:relative;">Pipeline parameters from <code>vars</code><a href="#pipeline-parameters-from-vars" aria-label="pipeline parameters from vars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The most requested feature was the ability to use parameters in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. For
example. So, you can pass the same seed value or filename to multiple stages in
the pipeline.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">train_matrix</span><span class="token punctuation">:</span> train.pkl
<span class="token punctuation">-</span> <span class="token key atrule">test_matrix</span><span class="token punctuation">:</span> test.pkl
<span class="token punctuation">-</span> <span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">20210215</span>
<span class="token punctuation">...</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">process</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python process.py \
<span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> \
<span class="token punctuation">-</span><span class="token punctuation">-</span>train $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> \
<span class="token punctuation">-</span><span class="token punctuation">-</span>test $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span>
<span class="token punctuation">...</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span></code></pre></div>
<p>Also, it gives an ability to localize all the important parameters in a single
<code>vars</code> block and play with them. This is a natural thing to do for scenarios
like NLP or when hyperparameter optimization is happening not only in the model
training code but in the data processing as well.</p>
<h3 id="pipeline-parameters-from-params-files" style="position:relative;">Pipeline parameters from params files<a href="#pipeline-parameters-from-params-files" aria-label="pipeline parameters from params files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It is quite common to define pipeline parameters in a config file or a
parameters file (like <code>params.yaml</code>) instead of in the pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>
itself. These parameters defined in <code>params.yaml</code> can also be used in
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># params.yaml</span>
<span class="token key atrule">models</span><span class="token punctuation">:</span>
<span class="token key atrule">us</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span></code></pre></div>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">build-us</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span>
python script.py
<span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span>
<span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>models.us.thresh<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span></code></pre></div>
<p>DVC properly tracks params dependencies for each stage starting from the
previous DVC version 1.0. See the
<a href="https://dvc.org/doc/command-reference/run#for-displaying-and-comparing-data-science-experiments" target="_blank" rel="nofollow noopener noreferrer"><code>--params</code> option</a>
of <code>dvc run</code> for more details.</p>
<h3 id="iterating-over-params-with-foreach-stages" style="position:relative;">Iterating over params with foreach stages<a href="#iterating-over-params-with-foreach-stages" aria-label="iterating over params with foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Iterating over params was a frequently requested feature. Now users can define
multiple similar stages with a templatized command.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">build</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token key atrule">gb</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-gb.hdf5'</span>
<span class="token key atrule">us</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span>
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span>
python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span></code></pre></div>
<h2 id="new-method-to-provision-cloud-compute-in-new-cml-release" style="position:relative;">New method to provision cloud compute in new CML release<a href="#new-method-to-provision-cloud-compute-in-new-cml-release" aria-label="new method to provision cloud compute in new cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are releasing new CML release 0.3 together with DVC 2.0. We developed a brand
new CML command <code>cml runner</code> that hides much of the complexity of configuring
and provisioning an instance, keeping your workflows free of bash scripting
clutter.</p>
<p>The new approach uses our new
<a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Iterative Terraform Provider</a>
under the hood instead of Docker Machine, as in the first version of CML.</p>
<p>This example workflow to launch an EC2 instance from a GitHub Action workflow
and then train a model. We hope you'll agree it's shorter, sweeter, and more
powerful than ever!</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train in the cloud'</span>
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> deploy
<span class="token key atrule">shell</span><span class="token punctuation">:</span> bash
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud aws \
--cloud-region us-west \
--cloud-type=t2.micro \
--labels=cml-runner</span>
<span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train my model'</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install -r requirements.txt
python train.py</span></code></pre></div>
<p>You'll get a pull request that looks something like this:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c06746a683bc64bdcbde8464ca728656/39600/sample_pr.png" alt="sample pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>All the code to replicate this example is up on a
<a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">brand new demo repository</a>.</p>
<p>Please find more details in the
<a href="https://dvc.org/blog/cml-runner-prerelease" target="_blank" rel="nofollow noopener noreferrer">CML 0.3 pre-release blog post</a> or
in the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML website</a>.</p>
<h2 id="github-actions-in-new-cml-release" style="position:relative;">GitHub Actions in new CML release<a href="#github-actions-in-new-cml-release" aria-label="github actions in new cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>One more thing: you might've noticed in our example workflow above that there's
a <a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">new CML GitHub Action</a>! The new
Action helps you setup CML, giving you one more way to mix and match the CML
suite of functions with your preferred environment.</p>
<p>The new Action is designed to be a straightforward, all-in-one install that
gives you immediate use of functions like <code>cml publish</code> and <code>cml runner</code>. You'll
add this step to your workflow:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1</code></pre></div>
<p><a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">More details are in the docs!</a></p>
<p>The same way you can reference DVC as a GitHub Action:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/dvc<span class="token punctuation">-</span>action@v1</code></pre></div>
<p><a href="https://github.com/iterative/setup-dvc" target="_blank" rel="nofollow noopener noreferrer">See DVC GitHub Action</a></p>
<h2 id="breaking-changes" style="position:relative;">Breaking changes<a href="#breaking-changes" aria-label="breaking changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We put a lot of efforts to make this release with very minimum amount of
breaking changes to simplify migration to the new version for the users:</p>
<ol>
<li>Dropped support for external outputs in Google Cloud Storage and changed the
default checksum from md5 to etag.</li>
<li>Dropped support for login with p12 files on service authentication for Google
Drive.</li>
<li>Stages without dependencies will not always run as if changed. Instead, use
<code>--always-changed</code>.</li>
<li>Environment variables inside the cmd of a stage using <code>${VAR}</code> syntax must be
escaped as <code>\${VAR}</code> in 2.0 due to the use of <code>${}</code> syntax for templating.</li>
</ol>
<h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thank you to all DVC users and community members for the help. Please try out
the new DVC and CML releases and do not get lost in your ML experiments!</p>https://dvc.org/blog/february-21-community-gemshttps://dvc.org/blog/february-21-community-gemsFri, 26 Feb 2021 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC Questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-i-noticed-i-have-a-dvc-config-file-and-a-configlocal-file-whats-best-practice-for-committing-these-to-my-git-repository" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/666708671333400599" target="_blank" rel="nofollow noopener noreferrer">Q: I noticed I have a DVC <code>config</code> file and a <code>config.local</code> file. What's best practice for committing these to my Git repository?</a><a href="#q-i-noticed-i-have-a-dvc-config-file-and-a-configlocal-file-whats-best-practice-for-committing-these-to-my-git-repository" aria-label="q i noticed i have a dvc config file and a configlocal file whats best practice for committing these to my git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC uses the <code>config</code> and <code>config.local</code> files to link your remote data
repository to your project. <code>config</code> is intended to be committed to Git, while
<code>config.local</code> is not - it's a file that you use to store sensitive information
(e.g. your personal credentials - username, password, access keys, etc. for
remote storage) or settings that are specific to your local environment.</p>
<p>Usually, you don't have to worry about ensuring your <code>config.local</code> file is
being ignored by Git- the only way to create a <code>config.local</code> file is using the
<code>--local</code> flag explicitly in functions like <a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote</code></a> and <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a>
commands, so you'll know you've made one! And your <code>config.local</code> file is
<code>.gitignored</code> by default. If you're concerned, take a look and make sure there
are no settings in your <code>config.local</code> file that you actually want in your
regular <code>config</code> file.</p>
<p>To learn more about <code>config</code> and <code>config.local</code>,
<a href="https://dvc.org/doc/command-reference/remote#example-add-a-default-local-remote" target="_blank" rel="nofollow noopener noreferrer">read up in our docs</a>.</p>
<h3 id="q-whats-the-best-way-to-install-the-new-version-of-dvc-in-a-conda-environment-im-concerned-about-the-paramiko-dependency" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/669173874247729165" target="_blank" rel="nofollow noopener noreferrer">Q: What's the best way to install the new version of DVC in a Conda environment? I'm concerned about the <code>paramiko</code> dependency.</a><a href="#q-whats-the-best-way-to-install-the-new-version-of-dvc-in-a-conda-environment-im-concerned-about-the-paramiko-dependency" aria-label="q whats the best way to install the new version of dvc in a conda environment im concerned about the paramiko dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When you install DVC via <code>conda</code>, it will come with dependencies like
<code>paramiko</code>.</p>
<p>The only exception when installing DVC as a Python library is with <code>pip</code>: you
might want to specify the kind of remote storage you need to make sure all
dependencies are present (like <code>boto</code> for S3). You can run
<code>pip install "dvc[<option>]"</code>, with supported options like <code>[s3]</code>, <code>[azure]</code>,
<code>[gdrive]</code>, <code>[gs]</code>, <code>[oss]</code>, <code>[ssh]</code>. Or, use <code>[all]</code> to include them all.</p>
<p>For more about installing DVC and its dependencies,
<a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a>.</p>
<h3 id="q-how-do-i-keep-track-of-changes-in-modules-that-my-dvc-pipeline-depends-on-for-example-i-have-a-pipeline-stage-that-runs-a-script-preparepy-which-imports-a-module-modulepy-if-modulepy-changes-how-will-dvc-know-to-rerun-the-pipeline-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/663952575984435220" target="_blank" rel="nofollow noopener noreferrer">Q: How do I keep track of changes in <em>modules</em> that my DVC pipeline depends on? For example, I have a pipeline stage that runs a script <code>prepare.py</code>, which imports a module <code>module.py</code>. If <code>module.py</code> changes, how will DVC know to rerun the pipeline stage?</a><a href="#q-how-do-i-keep-track-of-changes-in-modules-that-my-dvc-pipeline-depends-on-for-example-i-have-a-pipeline-stage-that-runs-a-script-preparepy-which-imports-a-module-modulepy-if-modulepy-changes-how-will-dvc-know-to-rerun-the-pipeline-stage" aria-label="q how do i keep track of changes in modules that my dvc pipeline depends on for example i have a pipeline stage that runs a script preparepy which imports a module modulepy if modulepy changes how will dvc know to rerun the pipeline stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If your DVC pipeline only lists <code>prepare.py</code> as a dependency, then changing code
in module files won't trigger a re-run of the pipeline. Meaning that if you run
<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> after updating <code>module.py</code>, DVC will simply return the result of
your last pipeline run and a message that nothing has changed.</p>
<p>To explain further why this happens:</p>
<p>DVC is platform agnostic and it doesn't know whether your command's executable
is <code>python</code>, some other script interpreter, or a compiled binary for that
matter.</p>
<blockquote>
<p>E.g. this is a valid stage: <code>dvc run -o hello.txt 'echo "Hello!" > hello.txt'</code>
(where the executable is echo).</p>
</blockquote>
<p>DVC also doesn't know what's going on inside the command's source code.
Therefore, any file that your code requires internally should be explicitly
specified as a pipeline stage dependency (in CLI, <code>dvc run -d</code> , or in YAML,
<code>deps:</code>) for DVC to track it.</p>
<p>If you're not interested in adding modules as explicit dependencies, there are a
few other approaches:</p>
<ul>
<li>Make your <code>requirements.txt</code> file a stage dependency (if the loaded module
comes from a package).</li>
<li>Manually rebuild the pipeline (with <a href="https://dvc.org/doc/command-reference/repro#--force"><code>dvc repro --force <stage>.dvc</code></a>) when you
know an unmarked dependency is changed – although this is prone to human
error.</li>
<li>Have a version/build number comment in the main script that always gets
updated when an unmarked dependency changes – this could be automated.</li>
</ul>
<p><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/658501655641325580" target="_blank" rel="nofollow noopener noreferrer">See here for more information on similar use cases.</a></p>
<p>We also have an ongoing discussion about this issue on our GitHub repository,
and we'd love your input.
<a href="https://github.com/iterative/dvc/issues/1577#issuecomment-568391709" target="_blank" rel="nofollow noopener noreferrer">Please participate in this issue if you can here!</a></p>
<h3 id="q-my-dvc-pipeline-has-a-lot-of-dependencies-and-i-dont-want-to-manually-write-them-all-out-in-my-dvcyaml-file-are-there-any-ways-to-use-wildcards-like--or-specify-directories-as-dependencies" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/803961071135883294" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC pipeline has <em>a lot</em> of dependencies, and I don't want to manually write them all out in my <code>dvc.yaml</code> file. Are there any ways to use wildcards (like <code>*</code>) or specify directories as dependencies?</a><a href="#q-my-dvc-pipeline-has-a-lot-of-dependencies-and-i-dont-want-to-manually-write-them-all-out-in-my-dvcyaml-file-are-there-any-ways-to-use-wildcards-like--or-specify-directories-as-dependencies" aria-label="q my dvc pipeline has a lot of dependencies and i dont want to manually write them all out in my dvcyaml file are there any ways to use wildcards like or specify directories as dependencies permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, you can set a directory to be a dependency or an output of a DVC pipeline
stage. This means you can have tens, hundreds, thousands or millions of
dependency files in one directory, and all you have to declare in the pipeline
is the address of that directory.</p>
<p><a href="https://dvc.org/doc/command-reference/run#options" target="_blank" rel="nofollow noopener noreferrer">Check out the all the options here.</a></p>
<h2 id="cml-questions" style="position:relative;">CML Questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-i-heard-theres-a-new-cml-feature-using-terraform-to-provision-runners-when-is-this-coming-out" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/812069229473562624" target="_blank" rel="nofollow noopener noreferrer">Q: I heard there's a new CML feature using Terraform to provision runners. When is this coming out?</a><a href="#q-i-heard-theres-a-new-cml-feature-using-terraform-to-provision-runners-when-is-this-coming-out" aria-label="q i heard theres a new cml feature using terraform to provision runners when is this coming out permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You're in luck, because we just shared this feature as part of the CML 0.3.0
pre-release! The pre-release introduced a new function, <code>cml runner</code>, which
upgraded our
<a href="https://github.com/iterative/cml_cloud_case/blob/b76aba13791ce18c5715f464f58877ffa10d4cfa/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">previous method for launching instances in the cloud from a CI workflow using Docker Machine</a>.
In the new <code>cml runner</code> function built on Terraform, you can deploy instances in
AWS and Azure with a single command (it used to take about 30 lines of code!).
For example, to launch a <code>t2.micro</code> instance on AWS from your GitHub Actions or
GitLab CI workflow, you'll run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">cml runner <span class="token punctuation">\</span>
<span class="token parameter variable">--cloud</span> aws <span class="token punctuation">\</span>
--cloud-region us-west <span class="token punctuation">\</span>
--cloud-type<span class="token operator">=</span>t2.micro <span class="token punctuation">\</span>
<span class="token parameter variable">--labels</span><span class="token operator">=</span>cml-runner</code></pre></div>
<p>Check out the <a href="https://dvc.org/blog/cml-runner-prerelease" target="_blank" rel="nofollow noopener noreferrer">pre-release notes</a>
and our
<a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">example project repository</a>
to get started.</p>
<h3 id="q-my-ci-workflow-creates-a-reportmdhttpreportmd-document-that-gets-published-to-my-pull-request-by-cml-i-want-to-save-the-reportmd-file-to-my-repository-too-is-this-possible" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/810946119374340127" target="_blank" rel="nofollow noopener noreferrer">Q: My CI workflow creates a <code>[report.md](http://report.md)</code> document that gets published to my pull request by CML. I want to save the <code>report.md</code> file to my repository, too. Is this possible?</a><a href="#q-my-ci-workflow-creates-a-reportmdhttpreportmd-document-that-gets-published-to-my-pull-request-by-cml-i-want-to-save-the-reportmd-file-to-my-repository-too-is-this-possible" aria-label="q my ci workflow creates a reportmdhttpreportmd document that gets published to my pull request by cml i want to save the reportmd file to my repository too is this possible permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>By default, files that are created in a GitHub Actions or GitLab CI workflow
only exist on the runner- as soon as the runner turns off, they vanish.
Functions like <code>cml publish</code> and <code>cml send-comment</code> create persistent links to
data visualizations, tables, and other outputs of your workflow so you can view
them long after your run ends. However, by design, CML doesn't commit files to
your repository (not all users want this!)</p>
<p>What you're likely looking for is an auto-commit, to essentially <code>git add</code> and
<code>git commit</code> files generated by the workflow to your repository. You can
manually write this code into your workflow file, or you can use a GitHub Action
tool like the
<a href="https://github.com/marketplace/actions/git-auto-commit" target="_blank" rel="nofollow noopener noreferrer">Auto Commit</a> or
<a href="https://github.com/marketplace/actions/add-commit" target="_blank" rel="nofollow noopener noreferrer">Add & Commit</a> Actions.</p>
<h3 id="q-do-you-have-any-suggested-caching-strategies-with-cml-and-dvc-my-dvc-pipeline-runs-in-a-ci-workflow-and-it-depends-on-15-gb-of-data-i-dont-want-to-download-this-dataset-to-my-runner-every-time-the-workflow-runs" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/812059539696386079" target="_blank" rel="nofollow noopener noreferrer">Q: Do you have any suggested caching strategies with CML and DVC? My DVC pipeline runs in a CI workflow, and it depends on ~15 GB of data. I don't want to download this dataset to my runner every time the workflow runs.</a><a href="#q-do-you-have-any-suggested-caching-strategies-with-cml-and-dvc-my-dvc-pipeline-runs-in-a-ci-workflow-and-it-depends-on-15-gb-of-data-i-dont-want-to-download-this-dataset-to-my-runner-every-time-the-workflow-runs" aria-label="q do you have any suggested caching strategies with cml and dvc my dvc pipeline runs in a ci workflow and it depends on 15 gb of data i dont want to download this dataset to my runner every time the workflow runs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Downloading data to a runner on every CI workflow can be needlessly time
consuming, particularly when the data rarely changes.</p>
<p>While we don't have a CML-specific mechanism in the works for this use case,
there are two main approaches we see as viable:</p>
<ol>
<li><strong>Attach an EBS volume</strong> to the instance that runs your workflow. If you're
using DVC, DVC needs to run in that volume (at the very least, your DVC cache
must be there). A user
<a href="https://discord.com/channels/485586884165107732/728693131557732403/812059539696386079" target="_blank" rel="nofollow noopener noreferrer">recently let us know</a>
that this approach is working well for them and prevents unnecessary
re-downloads of their DVC cache. They also
<a href="https://towardsdatascience.com/stop-duplicating-deep-learning-training-datasets-with-amazon-ebs-multi-attach-d9f61fdc1de4" target="_blank" rel="nofollow noopener noreferrer">recommended this article</a>
for setup guidelines.</li>
<li><strong>Use a shared DVC cache.</strong> Currently, many DVC users configure their cache
in shared <a href="https://en.wikipedia.org/wiki/Network_File_System" target="_blank" rel="nofollow noopener noreferrer">NFS</a>. A similar
setup that might help here is using a single shared development server-
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">check out our docs for a use case</a>.</li>
</ol>
<hr>
<p>As always, if you have any use case questions or need support, join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>! Or head to the
<a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</p>
<p>And, you can follow us on <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and
<a href="https://www.linkedin.com/company/iterative-ai" target="_blank" rel="nofollow noopener noreferrer">LinkedIn</a>!</p>https://dvc.org/blog/cml-runner-prereleasehttps://dvc.org/blog/cml-runner-prereleaseMon, 22 Feb 2021 00:00:00 GMT<p>Today, we're pre-releasing some new features in Continuous Machine Learning, or
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>—our open source project to adapt popular continuous
integration (CI) systems like GitHub Actions and GitLab CI for data science. CML
has become a popular tool for auto-generating ML model reports right in a GitHub
Pull Request and orchestrating resources for training models in the cloud.</p>
<p>Here's what's in today's pre-release:</p>
<h2 id="brand-new-method-to-provision-cloud-compute-for-your-ci-workflows" style="position:relative;">Brand new method to provision cloud compute for your CI workflows<a href="#brand-new-method-to-provision-cloud-compute-for-your-ci-workflows" aria-label="brand new method to provision cloud compute for your ci workflows permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>After the initial CML release, we found ways to significantly simplify the
process of allocating resources in CI/CD. We developed a brand new CML command
<code>cml runner</code> that hides much of the complexity of configuring and provisioning
an instance, keeping your workflows free of <code>bash</code> scripting clutter (until the
official release, docs are
<a href="https://github.com/iterative/cml/blob/c2b96c461011f01ab2476e1542fb89d7229d150d/README.md" target="_blank" rel="nofollow noopener noreferrer">in development here</a>).
The new approach uses Terraform provider under the hood instead of Docker
Machine, as in the first version.</p>
<p>Check out this example workflow to launch an EC2 instance from a GitHub Action
workflow and then train a model. We hope you'll agree it's shorter, sweeter, and
more powerful than ever!</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train in the cloud'</span>
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> deploy
<span class="token key atrule">shell</span><span class="token punctuation">:</span> bash
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
cml runner \
--cloud aws \
--cloud-region us-west \
--cloud-type=t2.micro \
--labels=cml-runner</span>
<span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train my model'</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install -r requirements.txt
python train.py</span></code></pre></div>
<p>If you use CML functions in the <code>train-model</code> step, you can go even further and
get a closed loop—sending model training results from the EC2 instance to your
pull request or merge request! For example, if we expand the <code>train-model</code> step
to incorporate functions like <code>cml publish</code> and <code>cml send-comment</code>:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span>
<span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span>
<span class="token key atrule">container</span><span class="token punctuation">:</span> docker<span class="token punctuation">:</span>//dvcorg/cml
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2
<span class="token key atrule">with</span><span class="token punctuation">:</span>
<span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train a model'</span>
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install -r requirements.txt
python train.py</span>
echo "<span class="token comment">## Report from your EC2 Instance" > report.md</span>
cat metrics.txt <span class="token punctuation">></span><span class="token punctuation">></span> report.md
cml publish "plot.png" <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md
cml send<span class="token punctuation">-</span>comment report.md</code></pre></div>
<p>You'll get a pull request that looks something like this:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c06746a683bc64bdcbde8464ca728656/39600/sample_pr.png" alt="sample pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>All the code to replicate this example is up on a
<a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">brand new demo repository</a>.</p>
<h3 id="our-favorite-details" style="position:relative;">Our favorite details<a href="#our-favorite-details" aria-label="our favorite details permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The new <code>cml runner</code> function lets you turn on instances, including GPU,
high-memory and spot instances, and kick off a new workflow using the hardware
and environment of your choice—and of course, it'll turn <em>off</em> those instances
after a configurable timeout! In the first CML release, this took
<a href="https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">more than 30 lines of code</a>
to configure. Now it's just one function.</p>
<p>Another highlight: you can use whatever Docker container you'd like on your
instance. In the above example, we use our
<a href="https://github.com/iterative/cml/blob/master/Dockerfile" target="_blank" rel="nofollow noopener noreferrer">custom CML Docker container</a>
(because we like it!)—but you certainly don't have to! Whatever image you
choose, we highly recommend containerizing your environment for ultimate
reproducibility and security with CML.</p>
<p>You can also use the new <code>cml runner</code> function to set up a
<a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">local self-hosted runner</a>.
On your local machine or on-premise GPU cluster, you'll install CML as a package
and then run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ cml runner <span class="token punctuation">\</span>
<span class="token parameter variable">--repo</span> <span class="token variable">$your_project_repository_url</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--token</span><span class="token operator">=</span><span class="token variable">$personal_access_token</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--labels</span> tf <span class="token punctuation">\</span>
--idle-timeout <span class="token number">180</span></code></pre></div>
<p>Now your machine will be listening for workflows from your project repository.</p>
<h2 id="a-new-github-action" style="position:relative;">A New GitHub Action<a href="#a-new-github-action" aria-label="a new github action permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>One more thing: you might've noticed in our example workflow above that there's
a <a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">new CML GitHub Action</a>! The new
Action helps you setup CML, giving you one more way to mix and match the CML
suite of functions with your preferred environment.</p>
<p>The new Action is designed to be a straightforward, all-in-one install that
gives you immediate use of functions like <code>cml publish</code> and <code>cml runner</code>. You'll
add this step to your workflow:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1</code></pre></div>
<p><a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">More details are in the docs!</a></p>
<h2 id="get-ready-for-the-release" style="position:relative;">Get ready for the release<a href="#get-ready-for-the-release" aria-label="get ready for the release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We're inviting our community members to explore these new features in
anticipation of our upcoming, <em>official</em> release. As always, feedback is welcome
by opening an issue on the
<a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML GitHub repository</a>, as a comment here or
via our <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. We're excited to hear
what you think!</p>https://dvc.org/blog/dvc-2-0-pre-releasehttps://dvc.org/blog/dvc-2-0-pre-releaseWed, 17 Feb 2021 00:00:00 GMT<h2 id="install" style="position:relative;">Install<a href="#install" aria-label="install permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First things first. You can install the 2.0 pre-release from the master branch
in our repo (instruction <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">here</a>) or
through pip:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> <span class="token parameter variable">--pre</span> dvc</span></code></pre></div>
<h2 id="ml-pipelines-parameterization-and-foreach-stages" style="position:relative;">ML pipelines parameterization and foreach stages<a href="#ml-pipelines-parameterization-and-foreach-stages" aria-label="ml pipelines parameterization and foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>After introducing the multi-stage pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it was quickly
adopted among our users. The DVC team got tons of positive feedback from them,
as well as feature requests.</p>
<h3 id="pipeline-parameters-from-vars" style="position:relative;">Pipeline parameters from <code>vars</code><a href="#pipeline-parameters-from-vars" aria-label="pipeline parameters from vars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The most requested feature was the ability to use parameters in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. For
example. So, you can pass the same seed value or filename to multiple stages in
the pipeline.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span>
<span class="token key atrule">train_matrix</span><span class="token punctuation">:</span> train.pkl
<span class="token key atrule">test_matrix</span><span class="token punctuation">:</span> test.pkl
<span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">20210215</span>
<span class="token punctuation">...</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">process</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python process.py \
<span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> \
<span class="token punctuation">-</span><span class="token punctuation">-</span>train $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> \
<span class="token punctuation">-</span><span class="token punctuation">-</span>test $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span>
<span class="token punctuation">...</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span></code></pre></div>
<p>Also, it gives an ability to localize all the important parameters in a single
<code>vars</code> block and play with them. This is a natural thing to do for scenarios
like NLP or when hyperparameter optimization is happening not only in the model
training code but in the data processing as well.</p>
<h3 id="pipeline-parameters-from-params-files" style="position:relative;">Pipeline parameters from params files<a href="#pipeline-parameters-from-params-files" aria-label="pipeline parameters from params files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It is quite common to define pipeline parameters in a config file or a
parameters file (like <code>params.yaml</code>) instead of in the pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>
itself. These parameters defined in <code>params.yaml</code> can also be used in
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># params.yaml</span>
<span class="token key atrule">models</span><span class="token punctuation">:</span>
<span class="token key atrule">us</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span></code></pre></div>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">build-us</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span>
python script.py
<span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span>
<span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>models.us.thresh<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span></code></pre></div>
<p>DVC properly tracks params dependencies for each stage starting from the
previous DVC version 1.0. See the
<a href="https://dvc.org/doc/command-reference/run#for-displaying-and-comparing-data-science-experiments" target="_blank" rel="nofollow noopener noreferrer"><code>--params</code> option</a>
of <code>dvc run</code> for more details.</p>
<h3 id="iterating-over-params-with-foreach-stages" style="position:relative;">Iterating over params with foreach stages<a href="#iterating-over-params-with-foreach-stages" aria-label="iterating over params with foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Iterating over params was a frequently requested feature. Now users can define
multiple similar stages with a templatized command.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">build</span><span class="token punctuation">:</span>
<span class="token key atrule">foreach</span><span class="token punctuation">:</span>
<span class="token key atrule">gb</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-gb.hdf5'</span>
<span class="token key atrule">us</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span>
<span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span>
<span class="token key atrule">do</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span>
python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span></code></pre></div>
<h2 id="lightweight-ml-experiments" style="position:relative;">Lightweight ML experiments<a href="#lightweight-ml-experiments" aria-label="lightweight ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC uses Git versioning as the basis for ML experiments. This solid foundation
makes each experiment reproducible and accessible from the project's history.
This Git-based approach works very well for ML projects with mature models when
only a few new experiments per day are run.</p>
<p>However, in more active development, when dozens or hundreds of experiments need
to be run in a single day, Git creates overhead — each experiment run requires
additional Git commands <code>git add/commit</code>, and comparing all experiments is
difficult.</p>
<p>We introduce lightweight experiments in DVC 2.0! This is how you can auto-track
ML experiments without any overhead from ML engineers.</p>
<p>⚠️ Note, our new ML experiment features (<a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a>) are experimental in the
coming release. This means that the commands might change a bit in the following
minor releases.</p>
<p><a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> can run an ML experiment with a new hyperparameter from
<code>params.yaml</code> while <a href="https://dvc.org/doc/command-reference/exp/diff"><code>dvc exp diff</code></a> shows metrics and params difference:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">3000</span>
</span>
Reproduced experiment(s): exp-bb55c
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp diff</span>
</span>Path Metric Value Change
scores.json auc 0.57462 0.0072197
Path Param Value Change
params.yaml featurize.max_features 3000 1500</code></pre></div>
<p>More experiments:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">4000</span>
</span>Reproduced experiment(s): exp-9bf22
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span>
</span>Reproduced experiment(s): exp-63ee0
Experiment results have been applied to your workspace.
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.ngrams</span><span class="token operator">=</span><span class="token number">3</span>
</span>Reproduced experiment(s): exp-80655
Experiment results have been applied to your workspace.</code></pre></div>
<p>In the examples above, hyperparameters were changed with the <code>--set-param</code>
option, but you can make these changes by modifying the params file instead. In
fact <em>any code or data files can be changed</em> and <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> will capture the
variations.</p>
<p>See all the runs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span>
<span class="token parameter variable">--include-params</span> featurize.max_features,featurize.ngrams</span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.max_features<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.ngrams<span class="token hide">**</span></span>
</span> ─────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 0.56359 5000 3
master 0.5674 1500 2
├── exp-80655 0.56359 5000 3
├── exp-63ee0 0.5515 5000 2
├── exp-9bf22 0.56448 4000 2
└── exp-bb55c 0.57462 3000 2
</span> ─────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Under the hood, DVC uses Git to store the experiments' meta-information. A
straight-forward implementation would create visible branches and auto-commit in
them, but that approach would over-pollute the branch namespace very quickly. To
avoid this issue, we introduced custom Git references <code>exps</code>, the same way as
GitHub uses custom references <code>pulls</code> to track pull requests (this is an
interesting technical topic that deserves a separate blog post). Below you can
see how it works.</p>
<p>No artificial branches, only custom references <code>exps</code> (do not worry if you don't
understand this part - it is an implementation detail):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch
</span>* master
<span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref
</span>5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655
f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0
0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c
9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22</code></pre></div>
<p>The best experiment can be promoted to the workspace and committed to Git.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-bb55c
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">'optimize max feature size'</span></span></code></pre></div>
<p>Alternatively, an experiment can be promoted to a branch (<code>big_fr_size</code> branch
in this case):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-80655 big_fr_size
</span>Git branch 'big_fr_size' has been created from experiment 'exp-c695f'.
To switch to the new branch run:
git checkout big_fr_size</code></pre></div>
<p>Remove all the experiments that were not used:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--force</span></span></code></pre></div>
<h2 id="model-checkpoints" style="position:relative;">Model checkpoints<a href="#model-checkpoints" aria-label="model checkpoints permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>ML model checkpoints are an essential part of deep learning. ML engineers prefer
to save the model files (or weights) at checkpoints during a training process
and return back when metrics start diverging or learning is not fast enough.</p>
<p>The checkpoints create a different dynamic around ML modeling process and need a
special support from the toolset:</p>
<ol>
<li>Track and save model checkpoints (DVC outputs) periodically, not only the
final result or training epoch.</li>
<li>Save metrics corresponding to each of the checkpoints.</li>
<li>Reuse checkpoints - warm-start training with an existing model file,
corresponding code, dataset version and metrics.</li>
</ol>
<p>This new behavior is supported in DVC 2.0. Now, DVC can version all your
checkpoints with corresponding code and data. It brings the reproducibility of
DL processes to the next level - every checkpoint is reproducible.</p>
<p>This is how you define checkpoints with live-metrics:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token parameter variable">-n</span> train <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span>
<span class="token parameter variable">-p</span> dropout,epochs,lr,process <span class="token punctuation">\</span>
<span class="token parameter variable">--checkpoint</span> model.h5 <span class="token punctuation">\</span>
<span class="token parameter variable">--live</span> logs <span class="token punctuation">\</span>
python train.py
</span>
Creating 'dvc.yaml'
Adding stage 'train' in 'dvc.yaml'</code></pre></div>
<p>Note, we use <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command instead of <code>dvc run</code>. Starting from DVC 2.0
we begin extracting all stage specific functionality under <a href="https://dvc.org/doc/command-reference/stage"><code>dvc stage</code></a> umbrella.
<code>dvc run</code> is still working, but will be deprecated in the following major DVC
version (most likely in 3.0).</p>
<p>Start the training process and interrupt it after 5 epochs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>'users.csv.dvc' didn't change, skipping
Running stage 'train':
> python train.py
...
^CTraceback (most recent call last):
...
KeyboardInterrupt</code></pre></div>
<p>Navigate in checkpoints:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 4 2.0702 0.30388 2.025 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 4 2.0702 0.30388 2.025 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Each of the checkpoints above is a separate experiment with all data, code,
paramaters and metrics. You can use the same <a href="https://dvc.org/doc/command-reference/exp/apply"><code>dvc exp apply</code></a> command to extract
any of these.</p>
<p>Another run continues this process. You can see how accuracy metrics are
increasing - DVC does not remove the model/checkpoint and training code trains
on top of it:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Existing checkpoint experiment 'exp-e15bc' will be resumed
...
^C
KeyboardInterrupt
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 9 1.7845 0.58125 1.7381 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────</code></pre></div>
<p>After modifying the code, data, or params, the same process can be resumed. DVC
recognizes the change and shows it (see experiment <code>b363267</code>):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train.py <span class="token comment"># modify code</span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> params.yaml <span class="token comment"># modify params</span>
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span>
</span>Modified checkpoint experiment based on 'exp-e15bc' will be created
...
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div>
<div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span>
</span> ──────────────────────────────────────────────────────────────────────────────
<span class="token rows"> workspace 13 1.5841 0.69262 1.5381 … 15 …
master - - - - … 5 …
│ ╓ exp-7ff06 13 1.5841 0.69262 1.5381 … 15 …
│ ╟ 6c62fec 12 1.6325 0.67248 1.5857 … 15 …
│ ╟ 4baca3c 11 1.6817 0.64855 1.6349 … 15 …
│ ╟ b363267 (2b06de7) 10 1.7323 0.61925 1.6857 … 15 …
│ ╓ 2b06de7 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
</span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div>
<p>Sometimes you might need to train the model from scratch. The reset option
removes the checkpoint file before training: <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a>.</p>
<h2 id="metrics-logging" style="position:relative;">Metrics logging<a href="#metrics-logging" aria-label="metrics logging permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Continuously logging ML metrics is a very common practice in the ML world.
Instead of a simple command-line output with the metrics values, many ML
engineers prefer visuals and plots. These plots can be organized in a "database"
of ML experiments to keep track of a project. There are many special solutions
for metrics collecting and experiment tracking such as sacred, mlflow, weight
and biases, neptune.ai, or others.</p>
<p>With DVC 2.0, we are releasing a new open-source library
<a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVC-Live</a> that provides functionality for
tracking model metrics and organizing metrics in simple text files in a way that
DVC can visualize the metrics with navigation in Git history. So, DVC can show
you a metrics difference between the current model and a model in <code>master</code> or
any other branch.</p>
<p>This approach is similar to the other metrics tracking tools with the difference
that Git becomes a "database" or of ML experiments.</p>
<h3 id="generate-metrics-file" style="position:relative;">Generate metrics file<a href="#generate-metrics-file" aria-label="generate metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Install the library:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> dvclive</span></code></pre></div>
<p>Instrument your code:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvclive
<span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DvcLiveCallback
dvclive<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token string">"logs"</span><span class="token punctuation">)</span> <span class="token comment">#, summarize=True)</span>
<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
<span class="token comment"># Set up DVC-Live callback:</span>
callbacks<span class="token operator">=</span><span class="token punctuation">[</span> DvcLiveCallback<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span>
<span class="token punctuation">)</span>
</code></pre></div>
<p>During the training you will see the metrics files that are continuously
populated each epochs:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> logs/
</span>accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv
<span class="token line"><span class="token input">$ </span><span class="token command">head</span> logs/accuracy.tsv
</span>timestamp step accuracy
1613645582716 0 0.7360000014305115
1613645585478 1 0.8349999785423279
1613645587322 2 0.8830000162124634
1613645589125 3 0.9049999713897705
1613645590891 4 0.9070000052452087
1613645592681 5 0.9279999732971191
1613645594490 6 0.9430000185966492
1613645596232 7 0.9369999766349792
1613645598034 8 0.9430000185966492</code></pre></div>
<p>In addition to the continuous metrics files, you will see the summary metrics
file and HTML file with the same file prefix. The summary file contains the
result of the latest epoch:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> logs.json <span class="token operator">|</span> python <span class="token parameter variable">-m</span> json.tool
</span>{
"step": 41,
"loss": 0.015958430245518684,
"accuracy": 0.9950000047683716,
"val_loss": 13.705962181091309,
"val_accuracy": 0.5149999856948853
}</code></pre></div>
<p>The HTML file contains all the visuals for continuous metrics as well as the
summary metrics on a single page:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b66f0f1e2076cdf2661acb4f621e7255/39600/dvclive-html.png" alt="dvclive html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Note, the HTML and the summary metrics files are generating automatically for
each. So, you can monitor model performance in realtime.</p>
<h3 id="git-navigation-with-the-metrics-file" style="position:relative;">Git-navigation with the metrics file<a href="#git-navigation-with-the-metrics-file" aria-label="git navigation with the metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC repository is NOT required to use the live metrics functionality from the
above. It works independently from DVC.</p>
<p>DVC repository becomes useful when the metrics and plots are committed in your
Git repository, and you need navigation around the metrics.</p>
<p>Metrics difference between workspace and the last Git commit:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git status</span> <span class="token parameter variable">-s</span>
</span> M logs.json
M logs/accuracy.tsv
M logs/loss.tsv
M logs/val_accuracy.tsv
M logs/val_loss.tsv
M train.py
?? model.h5
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json
</span>Path Metric Old New Change
logs.json accuracy 0.995 0.99 -0.005
logs.json loss 0.01596 0.03036 0.0144
logs.json step 41 36 -5
logs.json val_accuracy 0.515 0.5175 0.0025
logs.json val_loss 13.70596 3.29033 -10.41563</code></pre></div>
<p>The difference between a particular commit/branch/tag or between two commits:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json HEAD^ 47b85c
</span>Path Metric Old New Change
logs.json accuracy 0.995 0.998 0.003
logs.json loss 0.01596 0.01951 0.00355
logs.json step 41 82 41
logs.json val_accuracy 0.515 0.51 -0.005
logs.json val_loss 13.70596 5.83056 -7.8754</code></pre></div>
<p>The same Git-navigation works with the plots:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">--target</span> logs
</span>file:///Users/dmitry/src/exp-dc/plots.html</code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cdc4ec4dabed1d7de6b8606667ebfc83/39600/dvclive-diff-html.png" alt="dvclive diff html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Another nice thing about the live metrics - they work across ML experiments and
checkpoints, if properly set up in dvc stages. To set up live metrics, you need
to specify the metrics directory in the <code>live</code> section of a stage:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">live</span><span class="token punctuation">:</span>
<span class="token key atrule">logs</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">report</span><span class="token punctuation">:</span> <span class="token boolean important">true</span>
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> data</code></pre></div>
<h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>I'd like to thank all of you DVC community members for the feedback that we are
constantly getting. This feedback helps us build new functionalities in DVC and
make it more stable.</p>
<p>Please be in touch with us on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/february-21-dvc-heartbeathttps://dvc.org/blog/february-21-dvc-heartbeatTue, 16 Feb 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Happy February! Here's all the news to keep you up to date.</p>
<h2 id="weve-hired-and-are-still-hiring" style="position:relative;">We've hired and are still hiring!<a href="#weve-hired-and-are-still-hiring" aria-label="weve hired and are still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have four new team members this month!</p>
<p><a href="https://www.linkedin.com/in/david-berenbaum-20b6b424/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dave Berenbaum</strong></a> came
to Iterative.ai by way of a
<a href="https://github.com/iterative/dvc/pull/2107" target="_blank" rel="nofollow noopener noreferrer">previous contribution</a> to our open
source products while working as a Data Science Manager at Captial One. He joins
the team as a Technical Product Manager. We are thrilled he's here!</p>
<p><a href="https://www.linkedin.com/in/batuhan-osman-taskaya-7803b61a0/" target="_blank" rel="nofollow noopener noreferrer"><strong>Batuhan Taskaya</strong></a>
joins us as a DVC Software Engineer working on the Python core. Batuhan is
excited to work on open source full time and we are excited to have him do so!</p>
<p><a href="https://www.linkedin.com/in/jenifer-de-figueiredo/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jeny De Figueiredo</strong></a> is
involved in the Seattle area data science community at Data Circles and is a
WiDS Puget Sound Ambassador. She joins us as our new Community Manager and is
looking forward to further building and engaging the community in MLOps! (Hi!
This is me. 🙋🏻♀️ I'll be writing Heartbeat!)</p>
<p><a href="https://www.linkedin.com/in/rogermparent/" target="_blank" rel="nofollow noopener noreferrer"><strong>Roger Parent</strong></a> has already been a
big part of building DVC and <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>. He has been a primary
developer of a UI that interfaces with the DVC Python application to provide an
interface with the Experiments feature that's coming out with DVC 2.0. We are so
excited to have him joining us full time as Software Engineer.</p>
<p><img src="https://media.giphy.com/media/vAvWgk3NCFXTa/giphy.gif" alt="Search"></p>
<h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are on the hunt for a
<a href="https://docs.google.com/document/d/1aT5HZYt4kAUxXqD4JNTe3jPDlVUwSmnEWDPR2QoKdvo/edit" target="_blank" rel="nofollow noopener noreferrer">TypeScript Front-End Engineer</a>
to build SaaS and a VS Code UI for our popular machine learning tools: DVC and
CML. The ML tools ecosystem is what JS space was 10 years ago. Come join us on
this exciting project!</p>
<p>Our search continues for a
<a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer">Developer Advocate</a>
to support and inspire developers by creating new content like blogs, tutorials,
and videos - plus lead outreach through meetups and conferences.</p>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h2 id="iterativeai-featured-on-the-new-stack" style="position:relative;">Iterative.ai Featured on The New Stack<a href="#iterativeai-featured-on-the-new-stack" aria-label="iterativeai featured on the new stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://thenewstack.io/author/susanhall/" target="_blank" rel="nofollow noopener noreferrer">Susan Hall</a> of
<a href="https://thenewstack.io/" target="_blank" rel="nofollow noopener noreferrer">The New Stack.io</a> interviewed our very own CEO,
<a href="https://twitter.com/fullstackml" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a>, discussing the needs of ML
engineers and how Iterative.ai makes tools to enable version control and CI/CD
for versioning data and ML models.</p>
<blockquote>
<p>"ML engineers, they still need collaboration. They need GitHub for
collaboration, they need this CI/CD system to resolve [issues] between each
other, between the team and productions system." - Dmitry Petrov</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://thenewstack.io/iterative-ai-git-based-machine-learning-tools-for-data-engineers/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Learning Tools for ML Engineers</h4>
<div class="elp-description">Susan Hall</div>
<div class="elp-link">thenewstack.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-02-16/newstack_image-b2d3ce71adb8e6bfee248da2677d5804.png" alt="Learning Tools for ML Engineers">
</div>
</a>
</section>
<p></p>
<h2 id="workshops-and-talks" style="position:relative;">Workshops and Talks<a href="#workshops-and-talks" aria-label="workshops and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="developer-advocacy-for-data-science" style="position:relative;">Developer Advocacy for Data Science<a href="#developer-advocacy-for-data-science" aria-label="developer advocacy for data science permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>So you saw the post further up. 👆🏽 Curious about developer advocacy or what to
look for in a hire for this position?
<a href="https://twitter.com/drelleobrien" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> dove into this recently with
<a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> (author of a
<a href="https://mlbookcamp.com/" target="_blank" rel="nofollow noopener noreferrer">Data Science Bookcamp</a>)
<a href="https://www.youtube.com/watch?v=jv5W4jXk4P4" target="_blank" rel="nofollow noopener noreferrer">in this podcast</a> on
<a href="http://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">DataTalks.club</a> You can watch it here below. 👇🏼</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jv5W4jXk4P4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As ever, we have much to share from the great citizens of the DVC community.</p>
<h3 id="spacy-and-dvc-integration" style="position:relative;">spaCy and DVC Integration<a href="#spacy-and-dvc-integration" aria-label="spacy and dvc integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If your NLP team uses spaCy to manage your projects, with spaCy's release of
v3.0, you can now enjoy DVC integration to manage your workflow like Git! Check
out the <a href="https://spacy.io/usage/projects#integrations" target="_blank" rel="nofollow noopener noreferrer">documentation here</a> to
streamline and track your process! 🏆</p>
<p>
</p><section class="elp-content-holder">
<a href="https://spacy.io/usage/projects#integrations/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">spaCy Integration</h4>
<div class="elp-description">spaCy Integration with DVC</div>
<div class="elp-link">spacy.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-02-16/spacy_integration-5ed0b2ce56d8ed2cad219e7df076dce1.jpg" alt="spaCy Integration">
</div>
</a>
</section>
<p></p>
<h3 id="dagshub-and-dvc-integrations" style="position:relative;">DagsHub and DVC Integrations<a href="#dagshub-and-dvc-integrations" aria-label="dagshub and dvc integrations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This month two great articles came out regarding the integration of DAGsHub and
DVC. First, this article: [Datasets Should Behave Like Git Repo walks you
through the steps to use DVC in your data versioning. The following image shows
the dependencies and how you simply need to do a <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a> each time your
dataset or model changes to track the process.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dagshub.com/blog/datasets-should-behave-like-git-repositories/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Datasets Should Behave Like Git Repositories</h4>
<div class="elp-description">Steps to use DVC in your data versioning</div>
<div class="elp-link">dagshub.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-02-16/dagshub-logo-d90d994c91caee62972094d181d37c0f.png" alt="Datasets Should Behave Like Git Repositories">
</div>
</a>
</section>
<p></p>
<h3 id="did-you-say-works-out-of-the-box" style="position:relative;">Did you say "Works Out of the Box?"<a href="#did-you-say-works-out-of-the-box" aria-label="did you say works out of the box permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Also from DAGsHub, by CEO <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>,
<a href="https://dagshub.com/blog/dagshub-storage-zero-configuration-dataset-model-hosting/" target="_blank" rel="nofollow noopener noreferrer">Free Dataset & Model Hosting with Zero Configuration - Launching DAGsHub Storage</a>
tells how their new DAGsHub storage is a DVC remote that requires zero
configuration (!) and will allow for team and organization access controls as
well as easy visibility.</p>
<p><img src="https://media.giphy.com/media/Ftz07proVX6Rq/giphy.gif" alt="Friends"></p>
<h3 id="model-management-and-ml-workflow-orchestration-with-dvc-and-apache-airflow--️" style="position:relative;">Model Management and ML Workflow Orchestration with DVC and Apache Airflow 🇩🇪 ❗️<a href="#model-management-and-ml-workflow-orchestration-with-dvc-and-apache-airflow--%EF%B8%8F" aria-label="model management and ml workflow orchestration with dvc and apache airflow ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We're really excited about a German language workshop led by
<a href="https://twitter.com/matthiasniehoff" target="_blank" rel="nofollow noopener noreferrer">Matthias Niehoff</a>! The workshop will be a
part of the ML Summit 2021 taking place April 19-21st, but registration closes
February 18th. So time is ticking. ⏰ The Conference is online, but will be in
German. For more info, head here 👉🏽 for the
<a href="https://ml-summit.de/machine-learing/modellmanagement-und-ml-workflow-orchestrierung-mit-dvc-und-apache-airflow/" target="_blank" rel="nofollow noopener noreferrer">Workshop Details</a>.</p>
<h3 id="the-most-popular-n1-tool-used-by-teams-on-spell" style="position:relative;">"<em>The</em> most popular 'N+1' tool used by teams on Spell"<a href="#the-most-popular-n1-tool-used-by-teams-on-spell" aria-label="the most popular n1 tool used by teams on spell permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://spell.ml/blog/using-dvc-with-spell-YBHOChEAACgAaSmV" target="_blank" rel="nofollow noopener noreferrer">Using DVC as a Lightweight Feature Store on Spell</a>
by <a href="https://twitter.com/ResidentMario" target="_blank" rel="nofollow noopener noreferrer">Aleksey Bilogur</a> , reviews the process of
using DVC with Spell for managing changing datasets, enabling team-wide data
reproducibility and why Spell fans are DVC fans, and vice versa. 🔄</p>
<p><img src="https://media.giphy.com/media/GM8PrUsm92hRC/giphy.gif" alt="Fans"></p>
<h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">How do you deploy a machine learning model?<br><br>Check out my new post! <a href="https://t.co/Qx3RtQ7hO0">https://t.co/Qx3RtQ7hO0</a><br><br>In it we build:<br><br>🚀REST service with <a href="https://twitter.com/tiangolo">@tiangolo</a>'s sleek FastAPI<br>🌐Chrome extension to interact with the model<br>🐳Custom <a href="https://twitter.com/Docker">@Docker</a> images<br>🏇CI/CD with <a href="https://twitter.com/DVCorg">@DVCorg</a> + Github actions</p>— Mihail Eric (@mihail_eric) <a href="https://twitter.com/mihail_eric/status/1357014486377324547">February 3, 2021</a></blockquote>
<p>You're all caught up! See you at the next Community Gems 💎!</p>
<hr>
<p><em>Do you have any use case questions or need support? Join us in
<a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>
<p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and
best practices.</em></p>https://dvc.org/blog/january-21-community-gemshttps://dvc.org/blog/january-21-community-gemsTue, 26 Jan 2021 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-is-there-an-equivalent-of-git-restore-file-for-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799598181310267392" target="_blank" rel="nofollow noopener noreferrer">Q: Is there an equivalent of <code>git restore <file></code> for DVC?</a><a href="#q-is-there-an-equivalent-of-git-restore-file-for-dvc" aria-label="q is there an equivalent of git restore file for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! You'll want <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>. It restores the corresponding verion of your
DVC-tracked file or directory from
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">the cache</a>
to your local workspace.
<a href="https://dvc.org/doc/command-reference/checkout#checkout" target="_blank" rel="nofollow noopener noreferrer">Read up in our docs for more info!</a></p>
<h3 id="q-my-dataset-is-made-of-more-than-a-million-small-files-can-i-use-an-archive-format-like-targz-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/798983422965841920" target="_blank" rel="nofollow noopener noreferrer">Q: My dataset is made of more than <em>a million</em> small files. Can I use an archive format, like <code>tar.gz</code> with DVC?</a><a href="#q-my-dataset-is-made-of-more-than-a-million-small-files-can-i-use-an-archive-format-like-targz-with-dvc" aria-label="q my dataset is made of more than a million small files can i use an archive format like targz with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are some downsides to using archive formats, and often we discourage it-
but let's review some factors to consider, so you can make the best choice for
your project.</p>
<ul>
<li>If your <code>tar.gz</code> file changes at all- perhaps because you changed a single
file before zipping- you'll end up with an entirely new copy of the archive
every time you commit! This is not very space efficient, but if space isn't an
issue it might not be a dealbreaker.</li>
<li>Because of the way we optimize data transfer, you'll end up transferring the
whole archive anytime you modify a single file and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>/<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>.</li>
<li>In general, archives don't play nice with the concept of diffs. Looking back
at your git history, it can be challenging to log how files were deleted,
modified, or added when you're versioning archives.</li>
</ul>
<p>While we can't do much about the general issues that archives present for
version control systems, DVC does have some options that might help you achieve
better data transfer speeds. We recommend exploring DVC's built-in parallelism-
data transfer functions like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> have a flag (<code>-j</code>) for
increasing the number of jobs run simultaneously.
<a href="https://dvc.org/doc/command-reference/push#options" target="_blank" rel="nofollow noopener noreferrer">Check out the docs for more details</a>.</p>
<p>In summary, the advantage of using an archive format will depend on both how
often you modify your dataset and how often you need to push and pull data. You
might consider exploring both approaches (with and without compression) and run
some speed tests for your use case. We'd love to know what you find!</p>
<h3 id="q-my-dvc-remote-is-a-server-with-a-self-signed-certificate-when-i-push-data-dvc-is-giving-me-an-ssl-verification-error--how-can-i-get-around-this" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/800707271502856222" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC remote is a server with a self-signed certificate. When I push data, DVC is giving me an SSL verification error- how can I get around this?</a><a href="#q-my-dvc-remote-is-a-server-with-a-self-signed-certificate-when-i-push-data-dvc-is-giving-me-an-ssl-verification-error--how-can-i-get-around-this" aria-label="q my dvc remote is a server with a self signed certificate when i push data dvc is giving me an ssl verification error how can i get around this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>On S3 or S3-compatible storage, you can configure your AWS CLI to use a custom
certificate path.
<a href="https://docs.aws.amazon.com/credref/latest/refdocs/setting-global-ca_bundle.html" target="_blank" rel="nofollow noopener noreferrer">As suggested by their docs</a>,
you can also set the environment variable <code>AWS_CA_BUNDLE</code> to your <code>.pem</code> file.</p>
<p>Similarly, on HTTP and Webdav remotes, there's <code>REQUESTS_CA_BUNDLE</code> environment
variable that you can set your self-signed certificate file to.</p>
<p>Then, when DVC tries to access your storage, you should be able to get past SSL
verification!</p>
<h3 id="q-i-want-to-be-able-to-make-my-own-plots-in-python-with-data-points-from-my-dvc-plots-including-older-versions-of-those-plots-what-do-you-recommend-to-get-the-raw-historical-data" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799617584336338954" target="_blank" rel="nofollow noopener noreferrer">Q: I want to be able to make my own plots in Python with data points from my <code>dvc plots</code>, including older versions of those plots. What do you recommend to get the raw historical data?</a><a href="#q-i-want-to-be-able-to-make-my-own-plots-in-python-with-data-points-from-my-dvc-plots-including-older-versions-of-those-plots-what-do-you-recommend-to-get-the-raw-historical-data" aria-label="q i want to be able to make my own plots in python with data points from my dvc plots including older versions of those plots what do you recommend to get the raw historical data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We suggest</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> git <span class="token keyword">import</span> Repo
revs <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>plots<span class="token punctuation">.</span>collect<span class="token punctuation">(</span>revs<span class="token operator">=</span>revs<span class="token punctuation">)</span></code></pre></div>
<p>Then you can plot the data contained in <code>revs</code> to your heart's content!</p>
<h3 id="q-is-it-safe-to-share-a-dvc-remote-between-two-projects-or-registries" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799216349405904896" target="_blank" rel="nofollow noopener noreferrer">Q: Is it safe to share a DVC remote between two projects or registries?</a><a href="#q-is-it-safe-to-share-a-dvc-remote-between-two-projects-or-registries" aria-label="q is it safe to share a dvc remote between two projects or registries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can share a remote with as many projects as you like. Because DVC uses
content-addressible storage, you'll still get benefits like file deduplication
over every project that uses the remote. This can be useful if you're likely to
have many shared files across projects.</p>
<p>One big thing to watch out for: you have to be very careful with clearing the
DVC cache. Make sure you don't remove files associated with another project when
running <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> by using the <code>--projects</code> flag.
<a href="https://dvc.org/doc/command-reference/gc#options" target="_blank" rel="nofollow noopener noreferrer">Read up in the docs!</a></p>
<h3 id="q-can-i-throttle-the-number-of-simultaneous-uploads-to-remote-storage-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/802099863076208662" target="_blank" rel="nofollow noopener noreferrer">Q: Can I throttle the number of simultaneous uploads to remote storage with DVC?</a><a href="#q-can-i-throttle-the-number-of-simultaneous-uploads-to-remote-storage-with-dvc" aria-label="q can i throttle the number of simultaneous uploads to remote storage with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yep! That'll be the <code>-j/--jobs</code> flag, for example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">-j</span> <span class="token operator"><</span>number<span class="token operator">></span></span></code></pre></div>
<p>will control the number of simultaneous uploads DVC attempts when pushing files
to your remote storage
(<a href="https://dvc.org/doc/command-reference/push#push" target="_blank" rel="nofollow noopener noreferrer">see more in our docs</a>).</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-i-have-a-dvc-pipeline-that-i-want-to-run-in-cicd-specifically-i-only-want-to-reproduce-the-stages-that-have-changed-since-my-last-commit-what-do-i-do" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/796185815574511616" target="_blank" rel="nofollow noopener noreferrer">Q: I have a DVC pipeline that I want to run in CI/CD. Specifically, I only want to reproduce the stages that have changed since my last commit. What do I do?</a><a href="#q-i-have-a-dvc-pipeline-that-i-want-to-run-in-cicd-specifically-i-only-want-to-reproduce-the-stages-that-have-changed-since-my-last-commit-what-do-i-do" aria-label="q i have a dvc pipeline that i want to run in cicd specifically i only want to reproduce the stages that have changed since my last commit what do i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC pipelines, like makefiles, will only reproduce stages that DVC detects have
changed since the last commit. So to do this in CI/CD systems like GitHub
Actions or GitLab CI, you'll want to make sure the workflow a) syncs the runner
with the latest version of your pipeline, including all inputs and dependencies,
and b) reruns your DVC pipeline.</p>
<p>In practice, your workflow needs to include these two commands:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div>
<p>You pull the latest version of your pipeline, inputs and dependencies from cloud
storage with <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, and then <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> intelligently reproduces the
pipeline (meaning, it should avoid rerunning stages that haven't changed since
the last commit).</p>
<p>Check out an
<a href="https://github.com/iterative/cml_dvc_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">example workflow here</a>.</p>
<h3 id="q-im-using-dvc-and-cml-to-pull-data-from-cloud-storage-then-train-a-model-i-want-to-push-the-trained-model-into-cloud-storage-when-im-done-what-should-i-do" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/801553810618187796" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using DVC and CML to pull data from cloud storage, then train a model. I want to push the trained model into cloud storage when I'm done, what should I do?</a><a href="#q-im-using-dvc-and-cml-to-pull-data-from-cloud-storage-then-train-a-model-i-want-to-push-the-trained-model-into-cloud-storage-when-im-done-what-should-i-do" aria-label="q im using dvc and cml to pull data from cloud storage then train a model i want to push the trained model into cloud storage when im done what should i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>One approach is to run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> <span class="token operator"><</span>model<span class="token operator">></span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token operator"><</span>model<span class="token operator">></span></span></code></pre></div>
<p>to the end of your workflow. This will push the model file, but there's a
downside: it won't keep a strong link between the pipeline (meaning, the command
you used to generate the model and any code/data dependencies) and the model
file.</p>
<p>What we recommend is that you create a
<a href="https://dvc.org/doc/start/data-pipelines#get-started-data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipeline</a>
with one stage- training your model- and declaring your model file as an output.
Then, your workflow can look like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># get data</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span>
</span>
<span class="token comment"># run the pipeline</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span>
</span>
<span class="token comment"># push to remote storage</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--run-cache</span></span></code></pre></div>
<p>When you do this workflow with the <code>--run-cache</code> flags, you'll be able to save
all the results of the pipeline in the cloud
(<a href="https://dvc.org/doc/command-reference/push#options" target="_blank" rel="nofollow noopener noreferrer">read more here</a>). When the
run has completed, you can go to your local workspace and run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div>
<p>This will put your model in your local workspace! And, you get an immutable link
between the code version, data version and model you end up with.</p>
<p>We recommend this approach so you don't lose track of how model files relate to
the data and code that produced them. It's a little more work to set up, but
Future You will thank you!</p>
<p><img src="https://media.giphy.com/media/l0LEIXSRRuv9QQIRNI/giphy.gif" alt="Tim Robinson Reaction GIF by The Lonely Island"></p>https://dvc.org/blog/january-21-dvc-heartbeathttps://dvc.org/blog/january-21-dvc-heartbeatWed, 20 Jan 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Welcome to the first Heartbeat of 2021! Here's some new year news.</p>
<h3 id="were-still-hiring" style="position:relative;">We're still hiring<a href="#were-still-hiring" aria-label="were still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our search continues for a
<a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a>
to support and inspire developers by creating new content like blogs, tutorials,
and videos- plus lead outreach through meetups and conferences.</p>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h3 id="7000-stars-on-github" style="position:relative;">7000 stars on GitHub<a href="#7000-stars-on-github" aria-label="7000 stars on github permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We recently passed 7000 stars on the
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC GitHub repository</a>! We crossed the 7k
mark extremely close to midnight on New Year's Eve, so we probably hit it in
time for the new year in at least one time zone. Anyway, it made for a very
suspenseful countdown to midnight. Woot woot!</p>
<p><img src="https://media.giphy.com/media/QAPFLCrpfalPi/giphy.gif" alt="Make Countdown GIF"></p>
<p>The repo is HQ for DVC development, meaning- if you have an issue to report, a
feature to request, or a pull request to offer, this is where you should start!</p>
<h3 id="new-video-for-r-users" style="position:relative;">New video for R users<a href="#new-video-for-r-users" aria-label="new video for r users permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A lot of our videos about GitHub Actions have used Python scripts, but there's
no reason to restrict <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a> to one
language. We've just released our first-ever R language video, which covers</p>
<ul>
<li>How to install R on a GitHub Actions runner</li>
<li>How to manage R package dependencies for continuous integration (teaser: CRAN
binaries are amazing)</li>
<li>Putting a <code>ggplot</code> or a <code>kable</code> table in your pull request</li>
</ul>
<p>Watch and follow along! If you make something based on this approach, or if you
think there's a better way, please tell us- we're eager to see what the R
community thinks.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/NwUijrm2U2w?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="workshops-and-talks" style="position:relative;">Workshops and talks<a href="#workshops-and-talks" aria-label="workshops and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>On Friday, January 24, I (Elle) spoke with
<a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> (author of a
<a href="https://mlbookcamp.com/" target="_blank" rel="nofollow noopener noreferrer">Data Science Bookcamp</a>), on his podcast about being a
developer advocate in the machine learning space! If you're curious about what
the role entails, or what to look for when hiring a developer advocate for your
machine learning project, please come by. The event is up on YouTube, and will
soon be available as a podcast for your listening pleasure 🎧</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jv5W4jXk4P4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As ever, we have much to share from the great citizens of the DVC community.</p>
<h3 id="wheres-baby-yoda" style="position:relative;">Where's Baby Yoda?<a href="#wheres-baby-yoda" aria-label="wheres baby yoda permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There's a brand new blog post we love, and only half of that has to do with its
impressive collection of Baby Yoda pics.
<a href="https://dagshub.com/blog/author/simon/" target="_blank" rel="nofollow noopener noreferrer">Simon Lousky</a>, developer at
<a href="https://dagshub.com" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a>, published a blog provocatively titled
<a href="https://dagshub.com/blog/datasets-should-behave-like-git-repositories/" target="_blank" rel="nofollow noopener noreferrer"><em>Datasets should behave like git repositories</em></a>.
He writes:</p>
<blockquote>
<p>While data versioning solves the problem of managing data in the context of
your machine learning project, it brings with it a new approach to managing
datasets. This approach, also described as data registries here, consists of
creating a git repository entirely dedicated to managing a dataset. This means
that instead of training models on frozen datasets - something researchers,
students, kagglers, and open source machine learning contributors often do -
you could link your project to a dataset (or to any file for that matter), and
treat it as a dependency. After all, data can and should be treated as code,
and follow through a review process.</p>
</blockquote>
<p>We agree! Lousky goes on to show us a brilliant code example wherein he segments
instances of Baby Yoda out of frames from The Mandalorian. DVC plays a key role
in keeping track of all the Baby Yodas, which is pretty much the most important
use case we could've imagined.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 480px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/291a8f82c6d13846fb7a83a13386b1b6/39600/bb_yoda.png" alt="bb yoda" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Found them!</em></p>
<p>There's also a
<a href="https://www.reddit.com/r/MachineLearning/comments/l0l0oc/p_datasets_should_behave_like_git_repositories/" target="_blank" rel="nofollow noopener noreferrer">lively discussion about the post on Reddit</a>.
Check it out and consider contributing your own Baby Yoda image annotations to
grow the dataset!</p>
<h3 id="data-version-control-explained" style="position:relative;">Data Version Control Explained<a href="#data-version-control-explained" aria-label="data version control explained permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Researcher <a href="https://blog.crowdbotics.com/author/nimra/" target="_blank" rel="nofollow noopener noreferrer">Nimra Ejaz</a> published a
fantastically detailed introduction to DVC. She even included a "History of DVC"
section, which is pretty cool for us- this might be a first!</p>
<p>Her blog covers not only the key features of DVC, but a thoughtful pros-and-cons
list <em>and</em> a case study about using DVC in an image classification project. If
you want an up-to-date, high-level overview of DVC and some help deciding if it
fits your needs, I couldn't recommend Nimra's blog more.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://blog.crowdbotics.com/data-version-control-explained/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Version Control Explained</h4>
<div class="elp-description">Nimra Ejaz</div>
<div class="elp-link">crowdbotics.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-01-20/crowdbotics-cd9021f03aa5ede1fbe280a356617516.png" alt="Data Version Control Explained">
</div>
</a>
</section>
<p></p>
<h3 id="one-more-thing-from-dagshub" style="position:relative;">One more thing from DAGsHub<a href="#one-more-thing-from-dagshub" aria-label="one more thing from dagshub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>, CEO of DAGsHub, shared an important
update: they now offer FREE dataset and model hosting for DVC projects (up to 10
GB per user and project, with flexibility for public projects)! And with no
configuration!</p>
<p>That means you don't have to configure your DVC remote to use DVC with model and
data storage in the cloud- DAGsHub will handle <em>all</em> of it. Your DVC remote can
be added as easily as a Git remote, in other words. Read the announcement, and
then dig into their
<a href="https://dagshub.com/docs/experiment-tutorial/overview/" target="_blank" rel="nofollow noopener noreferrer">basic tutorial</a> to get
started.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dagshub.com/blog/dagshub-storage-zero-configuration-dataset-model-hosting/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage</h4>
<div class="elp-description">Dean Pleban</div>
<div class="elp-link">dagshub.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2021-01-20/dagshub-aa036fbcd9874d7c399ca6ef36cfc846.jpg" alt="Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage">
</div>
</a>
</section>
<p></p>
<h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/bibryam" target="_blank" rel="nofollow noopener noreferrer">Bilgin Ibryam</a>, author of the
<a href="https://www.redhat.com/en/engage/kubernetes-containers-architecture-s-201910240918" target="_blank" rel="nofollow noopener noreferrer">Kubernetes Patterns</a>
book, gave us a shoutout for being an interesting data engineering project
(according to a list by another expert we trust,
<a href="https://twitter.com/squarecog" target="_blank" rel="nofollow noopener noreferrer">Dmitry Ryabov</a>). Thanks Bilgin and Dmitry, we
think you're very interesting too!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Five Interesting Data Engineering Projects (<a href="https://twitter.com/getdbt">@getdbt</a>, <a href="https://twitter.com/PrefectIO">@PrefectIO</a>, <a href="https://twitter.com/dask_dev">@dask_dev</a>, <a href="https://twitter.com/DVCorg">@DVCorg</a>, greatexpectations)<a href="https://t.co/XXeLXYDp0M">https://t.co/XXeLXYDp0M</a> by <a href="https://twitter.com/squarecog">@squarecog</a></p>— Bilgin Ibryam (@bibryam) <a href="https://twitter.com/bibryam/status/1341777034448650242">December 23, 2020</a></blockquote>https://dvc.org/blog/december-20-community-gemshttps://dvc.org/blog/december-20-community-gemsWed, 30 Dec 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-is-there-a-way-to-plot-all-columns-in-a-csv-file-on-a-single-graph-using-dvc-plot" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/768689062314770442" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to plot all columns in a <code>.csv</code> file on a single graph using <code>dvc plot</code>?</a><a href="#q-is-there-a-way-to-plot-all-columns-in-a-csv-file-on-a-single-graph-using-dvc-plot" aria-label="q is there a way to plot all columns in a csv file on a single graph using dvc plot permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>By default, <code>dvc plot</code> graphs one or two columns from the metric file of your
choice (use the <code>-x</code> and <code>-y</code> flags to specify which columns).</p>
<p>However, there's nothing special about the way DVC makes plots. The plot
function is a wrapper for the <a href="https://vega.github.io/vega-lite-v1/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a>
grammar, which can make pretty much any kind of plot you can imagine. If you
check inside <code>.dvc/plots/</code>, you'll see a few Vega-Lite template files- that's
where the plotting instructions are stored!</p>
<p>You can create your own, or modify the existing templates, by
<a href="https://dvc.org/doc/command-reference/plots#plot-templates" target="_blank" rel="nofollow noopener noreferrer">following the instructions in our docs</a>.
In short, you'll create a new template and then run
<code>dvc plot show -t <name-of-template></code> to use it!</p>
<p>Vega-Lite has an
<a href="https://vega.github.io/editor/#/" target="_blank" rel="nofollow noopener noreferrer">interactive template editor online</a>, which
might help you test out ideas. Happy creating, and if you come up with a
template you'd like to share with the DVC community,
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">consider opening a pull request!</a></p>
<h3 id="q-my-teammate-and-i-are-having-some-issues-keeping-our-workplaces-synced-were-tracking-some-folders-with-dvc-and-he-recently-added-a-new-file-to-each-of-these-folders-how-does-he-update-the-tracked-folder-and-push-the-new-contents-so-i-can-access-them-too" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/785965719367843860" target="_blank" rel="nofollow noopener noreferrer">Q: My teammate and I are having some issues keeping our workplaces synced. We're tracking some folders with DVC, and he recently added a new file to each of these folders. How does he update the tracked folder and push the new contents so I can access them, too?</a><a href="#q-my-teammate-and-i-are-having-some-issues-keeping-our-workplaces-synced-were-tracking-some-folders-with-dvc-and-he-recently-added-a-new-file-to-each-of-these-folders-how-does-he-update-the-tracked-folder-and-push-the-new-contents-so-i-can-access-them-too" aria-label="q my teammate and i are having some issues keeping our workplaces synced were tracking some folders with dvc and he recently added a new file to each of these folders how does he update the tracked folder and push the new contents so i can access them too permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Your partner should first run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> <span class="token operator"><</span>folder<span class="token operator">></span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div>
<p>to update DVC about the new file and then push its contents to remote storage.
Next, they'll run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token operator"><</span>folder<span class="token operator">></span>.dvc
</span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span></span></code></pre></div>
<p>to update your shared Git repository. Then you can do a <code>git pull</code> and
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to sync the changes with your local workspace!</p>
<h3 id="q-i-forgot-to-declare-a-metric-output-in-my-dvcyaml-file-so-one-of-my-metrics-is-currently-untracked-how-can-i-fix-this-without-rerunning-the-stage-it-takes-a-long-time-to-run" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/781643749050155009" target="_blank" rel="nofollow noopener noreferrer">Q: I forgot to declare a metric output in my <code>dvc.yaml</code> file, so one of my metrics is currently untracked. How can I fix this without rerunning the stage? It takes a long time to run.</a><a href="#q-i-forgot-to-declare-a-metric-output-in-my-dvcyaml-file-so-one-of-my-metrics-is-currently-untracked-how-can-i-fix-this-without-rerunning-the-stage-it-takes-a-long-time-to-run" aria-label="q i forgot to declare a metric output in my dvcyaml file so one of my metrics is currently untracked how can i fix this without rerunning the stage it takes a long time to run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>No problem- what you'll want to do is edit your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> case and then run
<a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit dvc.yaml</code></a> to store the change.</p>
<p><a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> is a helpful function that updates your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a>
files as needed, which forces DVC to accept any modifications to tracked data
currently in your workspace. That should cover the case where you have a metric
file from your last pipeline run in your workspace, but forgot to add it to the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> as an output!</p>
<p><a href="https://dvc.org/doc/command-reference/commit#commit" target="_blank" rel="nofollow noopener noreferrer">Check out the docs</a> for
more about <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and how it can help you edit pipeline dependencies as
you work.</p>
<h3 id="q-can-i-have-multiple-dvcyaml-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/784083794583486496" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have multiple <code>dvc.yaml</code> files?</a><a href="#q-can-i-have-multiple-dvcyaml-files" aria-label="q can i have multiple dvcyaml files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes. The catch is that they have to be in separate directories. For example, you
can define independent pipelines in a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file each. It's also possible
to spread a single pipeline into more than one <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. DVC analyzes all
of them to rebuild the DAG(s), for example during <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p>
<h3 id="q-i-want-to-work-on-my-dvc-pipeline-on-a-different-computer-than-usual-for-the-stage-im-developing-i-dont-need-access-to-all-the-data-dependencies-of-the-earlier-stages--is-there-a-way-to-download-only-what-i-need" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/788068487246512158" target="_blank" rel="nofollow noopener noreferrer">Q: I want to work on my DVC pipeline on a different computer than usual. For the stage I'm developing, I don't need access to all the data dependencies of the earlier stages- is there a way to download only what I need?</a><a href="#q-i-want-to-work-on-my-dvc-pipeline-on-a-different-computer-than-usual-for-the-stage-im-developing-i-dont-need-access-to-all-the-data-dependencies-of-the-earlier-stages--is-there-a-way-to-download-only-what-i-need" aria-label="q i want to work on my dvc pipeline on a different computer than usual for the stage im developing i dont need access to all the data dependencies of the earlier stages is there a way to download only what i need permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Say for example that you have a pipeline like this:</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">+----------+
| data.dvc |
+----------+
*
*
*
+----+
| s1 |
+----+
*
*
*
+----+
| s2 |
+----+
*
*
*
+----+
| s3 |
+----+</code></pre></div>
<p>where stage <code>s2</code> is frozen (meaning, its dependencies will not change and we can
be reasonably sure the outputs of <code>s2</code> are static).</p>
<p>To work on stage <code>s3</code> in a new workspace, you could run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> s2
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> s3</span></code></pre></div>
<p>This set of commands will pull only the targeted stage (not the data
corresponding to <code>data.dvc</code>), and then execute the final stage of your pipeline
only.</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-why-do-you-need-docker-to-run-cml" style="position:relative;"><a href="https://www.youtube.com/watch?v=rVq-SCNyxVc&lc=UgzohiMVxO1GKB30bad4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Why do you need Docker to run CML?</a><a href="#q-why-do-you-need-docker-to-run-cml" aria-label="q why do you need docker to run cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Even though we use Docker in many of our tutorials, you technically <em>don't</em> need
it at all! Here's what's going on:</p>
<p>We use a custom Docker container that comes with the CML functions installed (as
well as some useful data science tools like Python, Vega-Lite, and CUDA
drivers). If you want to use your own Docker container, that's fine too- just
make sure you install the CML library of functions on your runner.</p>
<p>To install CML as an <code>npm</code> package on your runner, we recommend:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">npm i -g @dvcorg/cml</code></pre></div>
<p>Once this is done, you should be able to execute functions like <code>cml publish</code>
and <code>cml send-comment</code> on your runner.</p>
<p>For more tips about using CML without Docker,
<a href="https://github.com/iterative/cml#install-cml-as-a-package" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p>
<h3 id="q-im-using-cml-to-print-a-dvc-metrics-diff-to-my-pull-request-in-github-but-im-getting-an-error-token-not-found-what-does-that-mean" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/786382971706933258" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using CML to print a <code>dvc metrics diff</code> to my pull request in GitHub, but I'm getting an error: <code>token not found</code>. What does that mean?</a><a href="#q-im-using-cml-to-print-a-dvc-metrics-diff-to-my-pull-request-in-github-but-im-getting-an-error-token-not-found-what-does-that-mean" aria-label="q im using cml to print a dvc metrics diff to my pull request in github but im getting an error token not found what does that mean permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Generally, <code>token</code> refers to an authorization token that grants your runner
certain permissions with the GitHub API- such as the ability to post a comment
on your pull request. If you're working in GitHub, you don't have to follow any
manual steps to create a token. But you <em>do</em> need to make sure your
environmental variables in the workflow are named properly.</p>
<p>Make sure you've specified the following field in your workflow file:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div>
<p>The variable must be called <code>repo_token</code> for CML to recognize it!</p>
<p>A few other pointers:</p>
<ul>
<li>In GitLab, you have to set a variable in your repository called <code>repo_token</code>
whose value is Personal Access token. We have
<a href="https://github.com/iterative/cml/wiki/CML-with-GitLab#variables" target="_blank" rel="nofollow noopener noreferrer">step-by-step instructions in our docs</a>.
Forgetting to set this is the #1 issue we see with first-time GitLab CI users!</li>
<li>In BitBucket Cloud, you need to set a variable in your repository called
<code>repo_token</code> whose value is your API credentials. We have
<a href="https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud#repository-variables" target="_blank" rel="nofollow noopener noreferrer">detailed docs for creating this token</a>,
too.</li>
<li>Need to see more sample workflows to get a feel for it? We have plenty
<a href="https://dvc.org/doc/cml#case-studies" target="_blank" rel="nofollow noopener noreferrer">of case studies</a> to examine.</li>
</ul>
<h3 id="q-is-there-any-reason-why-an-experimental-dvc-feature-wouldnt-work-on-the-cml-docker-container" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/788512890394247178" target="_blank" rel="nofollow noopener noreferrer">Q: Is there any reason why an experimental DVC feature wouldn't work on the CML Docker container?</a><a href="#q-is-there-any-reason-why-an-experimental-dvc-feature-wouldnt-work-on-the-cml-docker-container" aria-label="q is there any reason why an experimental dvc feature wouldnt work on the cml docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Generally, no- the container <code>dvcorg/cml:latest</code> should have the latest DVC
release and the latest CML release (you can see where DVC and CML are installed
from in our
<a href="https://github.com/iterative/cml/blob/master/Dockerfile" target="_blank" rel="nofollow noopener noreferrer">Dockerfile</a>). So
besides the time it takes for releases to be published on various package
managers, there shouldn't be any lag. That means experimental features are ready
to play on your runner!</p>
<p>Note that you can also install pre-release versions of DVC- check out our
<a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">docs about installing the latest stable version ahead of official releases</a>.</p>https://dvc.org/blog/december-20-dvc-heartbeathttps://dvc.org/blog/december-20-dvc-heartbeatFri, 18 Dec 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Welcome to the December Heartbeat! Let's dive in with some news from the team.</p>
<h3 id="were-still-hiring" style="position:relative;">We're still hiring<a href="#were-still-hiring" aria-label="were still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our search continues for two roles:</p>
<ul>
<li>
<p>A
<a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Software Engineer</strong></a>
for the core DVC team- someone with strong Python development skills who can
build and ship essential DVC features.</p>
</li>
<li>
<p>A
<a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a>
to support and inspire developers by creating new content like blogs,
tutorials, and videos- plus lead outreach through meetups and conferences.</p>
</li>
</ul>
<p>Does this sound like you or someone you know? Be in touch!</p>
<h3 id="video-docs-complete" style="position:relative;">Video docs complete!<a href="#video-docs-complete" aria-label="video docs complete permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>As you may have heard
<a href="https://dvc.org/blog/november-20-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">last month</a>, we've been working
on adding complete video docs to the "Getting Started" section of the DVC site.
We now have 100% coverage! We have videos that mirror the tutorials for:</p>
<ul>
<li>
<p><a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">Data versioning</a> - how
to use Git and DVC together to track different versions of a dataset</p>
</li>
<li>
<p><a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">Data access</a> - how to share
models and datasets across projects and environments</p>
</li>
<li>
<p><a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">Pipelines</a> - how to create
reproducible pipelines to transform datasets to features to models</p>
</li>
<li>
<p><a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">Experiments</a> - how to do a <code>git diff</code>
for models that compares and visualizes metrics</p>
</li>
</ul>
<p><img src="https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif" alt="Mission Accomplished GIF by memecandy"></p>
<p>The
<a href="https://www.youtube.com/playlist?list=PL7WG7YrwYcnDb0qdPl9-KEStsL-3oaEjg" target="_blank" rel="nofollow noopener noreferrer">full playlist is on our YouTube channel</a>-
where, by the way, we've recently passed 2,000 subscribers! Thanks so much for
your support. There's much more coming up soon.</p>
<h3 id="collaboration-with-gitlab" style="position:relative;">Collaboration with GitLab<a href="#collaboration-with-gitlab" aria-label="collaboration with gitlab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We recently released a new blog with GitLab all about using <a href="cml.dev">CML</a> with
GitLab CI.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The team behind <a href="https://t.co/At942BC7sF">https://t.co/At942BC7sF</a> released an open source project called CML (continuous machine learning). <br><br>Learn more about GitLab ➕ <a href="https://twitter.com/DVCorg">@DVCorg</a>! <a href="https://t.co/eD8loo4mT5">https://t.co/eD8loo4mT5</a></p>— 🦊 GitLab (@gitlab) <a href="https://twitter.com/gitlab/status/1334631001956487171">December 3, 2020</a></blockquote>
<p>You may notice that the tweet spelled our name differently, and since Twitter
doesn't have an edit button, I think that means we're "Interative" now.
<a href="https://www.zazzle.com/t_shirt-235920696568133954" target="_blank" rel="nofollow noopener noreferrer">Hurry up and get your merch!</a></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 536px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3e2ee29409886ff96de8060077295dcd/39600/newname.png" alt="newname" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="workshops" style="position:relative;">Workshops<a href="#workshops" aria-label="workshops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We gave a workshop at a virtual meetup held by the
<a href="https://mlopsworld.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society</a>, and you
can catch a video recording if you missed it. This workshop was all about
getting started with GitHub Actions and CML! It starts with some high-level
overview and then gets into live-coding.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/51H13lfHdMw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There's no shortage of cool things to report from the community:</p>
<h3 id="the-dvc-udemy-course" style="position:relative;">The DVC Udemy Course<a href="#the-dvc-udemy-course" aria-label="the dvc udemy course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Now you can learn the fundamentals of machine learning engineering, from
experiment tracking to data management to continuous integration, with DVC and
Udemy! Data scientists/DVC ambassadors
<a href="https://www.udemy.com/user/mnrozhkov/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> and
<a href="https://www.udemy.com/user/marcel-da-camara-ribeiro-dantas/" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a>
created a course full of
<a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/?referralCode=68BEB2A7E246A54E5E35" target="_blank" rel="nofollow noopener noreferrer">practical tips and tricks for learners of all levels</a>.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/?referralCode=68BEB2A7E246A54E5E35" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Machine Learning Experiments and Engineering with DVC</h4>
<div class="elp-description">Automate machine learning experiments, pipelines and model deployment (CI/CD, MLOps) with Data Version Control (DVC).</div>
<div class="elp-link">udemy.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-12-18/udemy-90fceb9dfeae3078b718199d02bfd2d3.png" alt="Machine Learning Experiments and Engineering with DVC">
</div>
</a>
</section>
<p></p>
<h3 id="a-proposal-for-git-flow-with-dvc" style="position:relative;">A proposal for Git-flow with DVC<a href="#a-proposal-for-git-flow-with-dvc" aria-label="a proposal for git flow with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://www.uni-augsburg.de/en/fakultaet/fai/informatik/prof/swtpvs/team/fabian-rabe/" target="_blank" rel="nofollow noopener noreferrer">Fabian Rabe</a>
at <a href="https://www.uni-augsburg.de/en/" target="_blank" rel="nofollow noopener noreferrer">Universität Augsburg</a> wrote a killer doc
about his team's tried-and-true approach to creating a workflow for a DVC
project. He writes,</p>
<blockquote>
<p>Over the past couple of months we have started using DVC in our small team.
With a handful of developers all coding, training models & committing in the
same repository, we soon realized the need for a workflow.</p>
</blockquote>
<p>The post outlines three strategies his team adopted:</p>
<ol>
<li>
<p>Create a "debugging dataset" containing a subset of your data, with which you
can test your complete DVC pipeline locally on a developer's machine</p>
</li>
<li>
<p>Use CI-Runners to execute the DVC pipeline on the full dataset</p>
</li>
<li>
<p>Adopt a naming convention for Git branches that correspond to machine
learning experiments, in addition to the usual feature branches</p>
</li>
</ol>
<p>Agree? Disagree? Fabian is actively soliciting feedback on his proposal (and
possible solutions for some unresolved issues), so please read and
<a href="https://discuss.dvc.org/t/git-flow-for-dvc/578/6" target="_blank" rel="nofollow noopener noreferrer">chime in on our discussion board</a>.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://git.rz.uni-augsburg.de/rabefabi/git-flow-for-dvc" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Git Flow for DVC</h4>
<div class="elp-description">Fabian Rabe</div>
<div class="elp-link">git.rz.uni-augsburg.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-12-18/universitat_augs-72cc857d548d5f6bae11cf544b62c097.jpg" alt="Git Flow for DVC">
</div>
</a>
</section>
<p></p>
<h3 id="channel-9-talks-machine-learning-and-python" style="position:relative;">Channel 9 talks Machine Learning and Python<a href="#channel-9-talks-machine-learning-and-python" aria-label="channel 9 talks machine learning and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://channel9.msdn.com/Shows/AI-Show" target="_blank" rel="nofollow noopener noreferrer">The AI Show on Channel 9</a>, part of the
Microsoft DevRel universe, put out an episode all about ML and scientific
computing with Python featuring <a href="https://twitter.com/ixek" target="_blank" rel="nofollow noopener noreferrer">Tania Allard</a> and
<a href="https://twitter.com/sethjuarez" target="_blank" rel="nofollow noopener noreferrer">Seth Juarez</a>. Their episode includes how DVC
can fit in this development toolkit, so check it out!</p>
<div class="gatsby-resp-iframe-wrapper" style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden; "> <iframe src="https://channel9.msdn.com/Shows/AI-Show/Machine-Learning-and-Scientific-Computing-with-Python/player" allowfullscreen frameborder="0" title="Machine Learning and Scientific Computing with Python - Microsoft Channel 9 Video" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div>
<h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We'll end on a tweet we love:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I learned quite a bit in <a href="https://twitter.com/visenger">@visenger</a>'s talk about 10 fundamental practices for Machine Learning engineering. <br><br>Here is my <a href="https://twitter.com/hashtag/sketchnote?src=hash&ref_src=twsrc%5Etfw">#sketchnote</a> <a href="https://twitter.com/hashtag/INNOQTechnologyDay?src=hash&ref_src=twsrc%5Etfw">#INNOQTechnologyDay</a> <a href="https://t.co/tQjRrJq993">pic.twitter.com/tQjRrJq993</a></p>— Joy Heron (@iamjoyheron) <a href="https://twitter.com/iamjoyheron/status/1336698583689596929">December 9, 2020</a></blockquote>
<p>This beautiful diagram, made by <a href="https://twitter.com/iamjoyheron" target="_blank" rel="nofollow noopener noreferrer">Joy Heron</a> in
response to a talk by <a href="https://twitter.com/visenger" target="_blank" rel="nofollow noopener noreferrer">Dr. Larysa Visengeriyeva</a>
about MLOps, is a wonderful encapsulation of the many considerations (at many
scales) that go into ML engineering. Do you see DVC in there? 🕵️</p>
<p>Thank you for reading, and happy holidays to you! ❄️ 🎁 ☃️</p>https://dvc.org/blog/dvc-vs-rclonehttps://dvc.org/blog/dvc-vs-rcloneThu, 26 Nov 2020 00:00:00 GMT<p>Many general-use tools are available for synchronizing data to and from cloud
storage, some widely used options are <a href="https://rsync.samba.org/" target="_blank" rel="nofollow noopener noreferrer">rsync</a>,
<a href="https://rclone.org/" target="_blank" rel="nofollow noopener noreferrer">rclone</a> and
<a href="https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html" target="_blank" rel="nofollow noopener noreferrer">aws sync</a>, each
with their own advantages and disadvantages. Likewise, in <a href="https://dvc.org/">DVC</a> we provide
the ability to efficiently sync versioned datasets to and from cloud storage
through a git-like push and pull
<a href="https://dvc.org/doc/start/data-management/data-versioning" target="_blank" rel="nofollow noopener noreferrer">interface</a>.</p>
<p>Given that transferring data over a network to and from cloud storage is an
inherently slow operation, it's important for data sync tools to optimize
performance wherever possible. While the data transfer itself may be the most
apparent performance bottleneck in the data sync process, <strong>here we'll cover a
less obvious performance issue: How to determine which files to upload and
download.</strong></p>
<p>In this post, we'll outline the general methods used to solve this problem, and
investigate each method's effects on performance by comparing benchmark results
from DVC and rclone. We'll then conclude with a more in-depth explanation of new
optimizations made in DVC 1.0 which enabled us to outperform both older DVC
releases as well as general data sync tools (like rclone).</p>
<p><em>Note: "Cloud storage" and "remote storage" will be used interchangeably
throughout this post. When discussing dataset size in this post, we mean size in
terms of total number of files in a dataset, rather than the total amount of
file data (bytes).</em></p>
<h3 id="outline" style="position:relative;">Outline<a href="#outline" aria-label="outline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li><a href="#why-a-trivial-problem-has-a-not-so-trivial-performance-impact">Why a "trivial" problem has a not-so-trivial performance impact</a></li>
<li><a href="#real-world-numbers---dvc-and-rclone-performance-examples">Real-world numbers - DVC and rclone performance examples</a></li>
<li><a href="#how-dvc-10-speeds-things-up">How DVC 1.0 speeds things up</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h2 id="why-a-trivial-problem-has-a-not-so-trivial-performance-impact" style="position:relative;">Why a "trivial" problem has a not-so-trivial performance impact<a href="#why-a-trivial-problem-has-a-not-so-trivial-performance-impact" aria-label="why a trivial problem has a not so trivial performance impact permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>At the start of any data sync operation, we must first do the following steps,
in order to determine which files to upload and download between the local
machine and cloud storage:</p>
<ol>
<li>Determine which files are present locally.</li>
<li>Query the cloud storage API to determine which files are present in the
cloud.</li>
<li>Compute the difference between the two sets of files.</li>
</ol>
<p>Once this difference in file status has been determined, the necessary files can
be copied to or from cloud storage as needed ("file status" meaning file
existence as well as other potential status information, such as modification
time). <strong>While this may seem like a trivial problem, the second step is actually
a significant potential performance bottleneck.</strong></p>
<p>In general, cloud storage APIs provide two possible ways to determine what files
are present in cloud storage, and it's up to the data sync tool to select which
method to use. Even for an operation as simple as synchronizing a single local
file to cloud storage, choosing incorrectly between these two options could
actually mean the difference between that "simple" operation taking several
hours to complete instead of just a few seconds.</p>
<p><em>Note: The term "file status query" will be used throughout this post when
referring to this type of cloud storage API query.</em></p>
<h3 id="method-1-query-individual-files" style="position:relative;">Method 1: Query individual files<a href="#method-1-query-individual-files" aria-label="method 1 query individual files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The first query method is to individually check whether or not particular files
exist in cloud storage, one at a time.</p>
<p><em>Ex: The S3 API provides the <code>HeadObject</code> method.`</em></p>
<p>When using this method, performance depends on the number of files being
queried - for a single file, it would take a single API request, for 1 million
files, it would take 1 million API requests. In this case, the overall amount of
time it will take to complete the full operation will scale with the number of
files to query.</p>
<p>One particular advantage to using this method is that it can be easily
parallelized. Overall runtime can be improved by making simultaneous API
requests to query for multiple files at once.</p>
<h3 id="method-2-query-full-remote-listing" style="position:relative;">Method 2: Query full remote listing<a href="#method-2-query-full-remote-listing" aria-label="method 2 query full remote listing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The second query method is to request the full listing of files present in cloud
storage, all at once.</p>
<p><em>Ex: The S3 API provides the <code>ListObjects</code> method.</em></p>
<p>With this method, the overall amount of time it will take to complete the full
operation scales with the total number of files in cloud storage, rather than
the number of files we wish to query.</p>
<p>It's important to note that when using this method, cloud APIs will only return
a certain number of files at a time (the amount returned varies depending on the
API). This means that for an API which returns 1000 files at a time (such as
S3), retrieving the full listing of a remote containing 1000 files or less would
would only take a single API request. Listing a remote which contains 1 million
files would take 1000 API requests.</p>
<p>Another important note is that API calls for this method must be made
sequentially and can't be easily parallelized. Using S3 as an example, the first
API call would return files 0 through 999. The next call would return files 1000
through 1999, and so on. However, the API provides no guarantee of ordering, and
API calls must be made sequentially, until the full list has been retrieved. So
we can't make two simultaneous requests for both "files 1-999" and "files
1000-1999".</p>
<h3 id="how-selecting-one-method-or-the-other-can-drastically-improve-performance" style="position:relative;">How selecting one method or the other can drastically improve performance<a href="#how-selecting-one-method-or-the-other-can-drastically-improve-performance" aria-label="how selecting one method or the other can drastically improve performance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Consider an example scenario where a dataset being synchronized contains 100
local files, and we need to check which of those files exist in cloud storage.
For the purposes of this example, we'll also assume that all individual API
calls take the same amount of time to complete, and that we are not running any
tasks in parallel. Additionally, let's say that our example cloud storage API
returns 1000 files per page when using query method 2.</p>
<p>In this situation, we know that the first query method will always take a fixed
number of API calls to complete (100). The number of API calls required for the
second query method depends on the total number of files that already exist in
the remote.</p>
<p>Since we know that the API returns 1000 results per API call, we can say that if
the remote contains less than <code>1000 * 100 = 100,000</code> files, fetching the full
remote listing (method 2) will be faster than checking each file individually,
since it will take less than 100 API calls to complete. In the case that the
remote contains 1000 or less files, method 2 would only require a single API
call (potentially outperforming method 1 by 100x).</p>
<p>However, if the remote contains anything over this 100,000 threshold, method 1
will be faster than method 2, with the difference in performance between the two
methods scaling linearly as the potential remote size increases.</p>
<p><strong>Total API calls required to query 100 local files from S3</strong>
<img src="https://dvc.org/2020-11-26/api_calls_100_local-72e1167532070d287193c1edc06d31ec.svg" alt="API calls" title="API calls required to query 100 local files from S3"></p>
<p>This example illustrates an important point. Given a (relatively) small set of
files to query and a sufficiently large remote, method 1 will always be faster
than method 2.</p>
<p>Thinking about it from a different perspective, what happens if we have the
ability to reduce the size of a (relatively) large query set?</p>
<p>Once our query set is smaller than a certain threshold, we'll be able to use
method 1 rather than method 2. On top of that, we know that the runtime of
method 1 scales with query set size. <strong>In simple terms, by reducing the size of
our query set as much as possible, we can also improve performance.</strong></p>
<p>So, as we have shown, choosing the optimal method depends on both:</p>
<ul>
<li>The number of files that we need to query.</li>
<li>The total number of files in the remote.</li>
</ul>
<p><em>Note: In terms of real world performance, there are other considerations that
DVC must account for, such as different API calls taking different amounts of
time to complete, parallelization, and the amount of time it takes to run list
comparison operations in Python.</em></p>
<h2 id="real-world-numbers---dvc-and-rclone-performance-examples" style="position:relative;">Real-world numbers - DVC and rclone performance examples<a href="#real-world-numbers---dvc-and-rclone-performance-examples" aria-label="real world numbers dvc and rclone performance examples permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now let's take a look at some real-world numbers to examine the impact selecting
one query method or the other has on data sync performance in DVC and rclone.
Both tools can utilize either potential query method, with some differences:</p>
<ul>
<li>In rclone, the user can specify the <code>--no-traverse</code> option to select the first
query method, otherwise rclone will default to the second method in most
situations (with the exception being cases with very small query set sizes).</li>
<li>In DVC prior to 1.0, the first query method would be used by default for all
supported cloud storage platforms except Google Drive, and the user could
specify one method or the other via the <code>no_traverse</code> configuration option.</li>
<li><strong>In DVC 1.0 and later, the optimal query method is selected automatically.</strong></li>
</ul>
<p>In the following scenarios, we are simulating the typical DVC use case in which
a user tracks a local directory containing some number of files using DVC, and
then synchronizes the DVC-tracked directory to cloud storage (S3 in these
examples) using either DVC or rclone. The user would then continually repeat a
process of:</p>
<ol>
<li>Modify a small subset of files in the directory.</li>
<li>Push the updated version of the directory into cloud storage.</li>
</ol>
<p>Keep in mind that for DVC's purposes, we are most interested in optimizing
performance for scenarios which are normally very slow to complete. If you
consider an operation which previously took several hours to complete, improving
that runtime down to a few minutes will have a much greater impact for our users
versus shaving a few seconds off of an operation which previously took under a
minute to run.</p>
<p><em>Note: For these benchmarks we are only interested in the amount of time
required to determine file status for this one-way push operation. So the
runtimes in each case are for status queries only (using <a href="https://dvc.org/doc/command-reference/status#-c"><code>dvc status -c</code></a> in DVC
and <code>rclone copy --dry-run</code> in rclone). No file data was transferred to or from
S3 in any of these scenarios.</em></p>
<p><em>Benchmark command usage:</em></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">time</span> dvc status <span class="token parameter variable">-c</span> <span class="token parameter variable">-r</span> remote
</span><span class="token line"><span class="token input">$ </span><span class="token command">time</span> rclone copy <span class="token parameter variable">--dry-run</span> <span class="token parameter variable">--progress</span> <span class="token parameter variable">--exclude</span> <span class="token string">"**/**.unpacked/"</span> .dvc/cache remote:<span class="token punctuation">..</span>.</span></code></pre></div>
<p><em>rclone run with <code>--no-traverse</code> where indicated</em></p>
<p><em>Benchmark platform: Python 3.7, macOS Catalina, DVC installed from pip,
dual-core 3.1GHz i7 cpu</em></p>
<p><strong>Local directory w/100k total files, S3 bucket w/1M total files (1 file
modified since last sync)</strong>
<img src="https://dvc.org/2020-11-26/dvc_rclone_bench-71a153aa67b33f2de5c350dab7dbebd3.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison"></p>
<p>The previous chart contains benchmarks for a scenario in which the local
directory contains 100,000 files, and the S3 bucket contains approximately 1
million files. One file in the local directory has been modified since the
directory was last synchronized with the S3 bucket. This scenario tests the
length of time it takes DVC or rclone to determine (and report to the user) that
only the one modified file is missing from the S3 bucket and needs to be
uploaded.</p>
<p>This illustrates DVC's performance advantage over rclone with regard to
synchronizing iterations of a versioned dataset over time, as well as the DVC
1.0 performance improvements over prior releases.</p>
<p><em>Note: In these examples, the local file count refers to the number of files
inside the original tracked directory. The number of files present in the DVC
cache will differ slightly, since the DVC cache will contain an additional file
representing the tracked directory itself, but the end result is that both DVC
and rclone will both need to query for the same number of files (i.e. the number
of files in the cache directory).</em></p>
<p><strong>Local directory w/1 file, S3 bucket w/1M total files</strong>
<img src="https://dvc.org/2020-11-26/dvc_rclone_bench2-1ec9a63c6674ee11a5147f15958608d8.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison"></p>
<p>In this example, we are testing a simple scenario in which the local directory
contains 1 file and the S3 bucket contains approximately 1 million files.</p>
<p>In this case, in DVC 0.91 we essentially get lucky that our default choice for
S3 happens to be the first query method. If we ran this same scenario with a
Google Drive remote (where the 0.91 default choice is the second query method)
instead of S3, we would see a very long runtime for DVC 0.91.</p>
<p>Also note that here, rclone is able to determine that with a single local file
to query, it should use the first query method instead of defaulting to the
second method.</p>
<p><em>Note: We are unsure of the reason for the rclone runtime difference with and
without <code>--no-traverse</code> for this scenario, but rclone does do some computation
to determine whether or not to default to <code>no-traverse</code> behavior for small query
sets. It's likely that specifying <code>--no-traverse</code> allows rclone to skip that
overhead entirely in this case.</em></p>
<p><strong>Local directory w/1M files, Empty S3 bucket</strong>
<img src="https://dvc.org/2020-11-26/dvc_rclone_bench3-ae6c58603cf1aa93382fcdcdbff9ec4b.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison">
<em>Note: DVC 0.91 and rclone with <code>--no-traverse</code> both take multiple hours to
complete in this scenario and continue off of the chart.</em></p>
<p>In this example, we are testing a simple scenario in which the local directory
contains approximately 1 million files and the S3 bucket is empty.</p>
<p>The difference in rclone runtime with or without <code>--no-traverse</code> in this
scenario shows the performance impact of selecting the optimal query method for
a given situation.</p>
<p>This scenario also shows that rclone can outperform DVC with regard to
collecting the list of local files during certain types of sync operations. In
this case, rclone simply iterates over whatever files exist in the local
directory without doing any additional steps, since our benchmark uses a one-way
<code>rclone copy</code> operation.</p>
<p>However, in DVC, we have some extra overhead for this step, since we collect the
list of files expected to be present in the current DVC repository revision, and
then verify that those files are present locally. We would then check to see if
any missing files are available to be downloaded from remote storage.</p>
<p>It should also be noted that in common use cases where the number of files in
cloud storage continues to grow over time (such as in backup solutions or in
dataset versioning), rclone's advantage in this case would only apply for this
initial sync operation. Once the local dataset has been pushed to cloud storage,
DVC's advantage in synchronizing modifications to existing datasets would become
more apparent (as shown in the first example).</p>
<h2 id="how-dvc-10-speeds-things-up" style="position:relative;">How DVC 1.0 speeds things up<a href="#how-dvc-10-speeds-things-up" aria-label="how dvc 10 speeds things up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>So I hope that by now you're curious about DVC, and are planning on using (or
maybe even already are using 😀) it to sync your files. For those who are
wondering where the magic actually happens, let's dive a bit deeper into how DVC
stores files, and how we were able to leverage that storage format to implement
query performance optimzations in DVC 1.0. (This will also be a useful primer
for anyone interested in learning about DVC internals in general.)</p>
<p>Previously, we have established that:</p>
<ul>
<li>Selecting the right query method will have a significant performance impact.</li>
<li>Reducing the number of files to query will improve performance.</li>
</ul>
<p>In this section, we'll cover the ways in which DVC 1.0 has directly addressed
both of these key points:</p>
<ul>
<li>Automatically selecting the optimal query method for any given sync operation.</li>
<li>Indexing cloud storage remotes to eliminate the need to query for already
synchronized files.</li>
</ul>
<h3 id="dvc-storage-structure" style="position:relative;">DVC storage structure<a href="#dvc-storage-structure" aria-label="dvc storage structure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Before continuing, it will be helpful for the reader to understand a few things
about the DVC cache and remote storage structure.</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">.
├── 00
│ ├── 411460f7c92d2124a67ea0f4cb5f85
│ ├── 6f52e9102a8d3be2fe5614f42ba989
│ └── ...
├── 01
├── 02
├── 03
├── ...
└── ff</code></pre></div>
<p><em>Example DVC cache/remote structure</em></p>
<ul>
<li>Files versioned by DVC are identified and stored in subdirectories according
to their <a href="https://en.wikipedia.org/wiki/MD5" target="_blank" rel="nofollow noopener noreferrer">MD5</a> hash (i.e.
<a href="https://en.wikipedia.org/wiki/Content-addressable_storage" target="_blank" rel="nofollow noopener noreferrer">content addressable storage</a>).</li>
<li>MD5 is an
<a href="https://michiel.buddingh.eu/distribution-of-hash-values" target="_blank" rel="nofollow noopener noreferrer">evenly distributed</a>
hash function, so the DVC cache (and DVC remote storage) will be evenly
distributed (i.e. given a large enough dataset, each remote subdirectory will
contain an approximately equal number of files)</li>
</ul>
<h3 id="how-dvc-10-automatically-selects-a-query-method" style="position:relative;">How DVC 1.0 automatically selects a query method<a href="#how-dvc-10-automatically-selects-a-query-method" aria-label="how dvc 10 automatically selects a query method permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In DVC, the number of files we need to query is just the number of files for a
given project revision. So, as long as we can estimate the number of files in a
DVC remote, we can programmatically choose the optimal query method for a remote
operation.</p>
<p>In DVC 1.0, we accomplish this by taking advantage of the DVC remote structure.
The over/under remote size threshold only depends on the number of files being
queried (i.e. the number of files in our DVC versioned dataset). And as we have
already established, a DVC remote will be evenly distributed. Therefore, if we
know the number of files contained in a subset of the remote, we can then
estimate the number of files contained in the entire remote.</p>
<p>For example, if we know that the remote subdirectory <code>00/</code> contains 10 files, we
can estimate that the remote contains roughly <code>256 * 10 = 2,560</code> files in total.
So, by requesting a list of one subdirectory at a time (rather than the full
remote) via the cloud storage API, we can calculate a running estimate of the
total remote size. If the running estimated total size goes over the threshold
value, DVC will stop fetching the contains of the remote subdirectory, and
switch to querying each file in our dataset individually. If DVC reaches the end
of the subdirectory without the estimated size going over the threshold, it will
continue to fetch the full listing for the rest of the remote.</p>
<p>By estimating remote size in DVC 1.0, we can ensure that we always use the
optimal method when querying remote status.</p>
<h3 id="how-dvc-10-uses-indices-to-reduce-the-number-of-files-to-query" style="position:relative;">How DVC 1.0 uses indices to reduce the number of files to query<a href="#how-dvc-10-uses-indices-to-reduce-the-number-of-files-to-query" aria-label="how dvc 10 uses indices to reduce the number of files to query permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A common DVC use case is
<a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">versioning</a> the
contents of a large directory. As the contents of the directory changes over
time, DVC will be used to push each updated version of the directory into cloud
storage. In many cases, only a small number of files within that directory will
be modified between project iterations.</p>
<p>So after the first version of a project is pushed into cloud storage, for
subsequent versions, only the small subset of changed files actually needs to be
synchronized with cloud storage.</p>
<p>Consider a case where a user has an existing directory with 1 million files
which has been versioned and pushed to a remote with DVC. In the next iteration
of the project, only a single file in the directory has been modified. We can
obviously see that everything other than the one modified file will already
exist in cloud storage. Ideally, we should only need to query for the single
modified file.</p>
<p>However, in DVC releases prior to 1.0, DVC would always need to query for every
file in the directory, regardless of whether or not a given file had changed
since the last time it was pushed to remote storage.</p>
<p>But in DVC 1.0, we now keep an index of directories which have already been
versioned and pushed into remote storage. By referencing this index, DVC will
"remember" which files already exist in a remote, and will remove them from our
query set at the start of a data sync operation (before we choose a query
method, and before we make any cloud storage API requests).</p>
<p><em>Note: This optimization only applies to DVC versioned directories. Individually
versioned files (including those added with <a href="https://dvc.org/doc/command-reference/add#-R"><code>dvc add -R</code></a>) are not indexed in DVC
1.0, and will always be queried during remote operations.</em></p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>By utilizing a storage structure that allows for optimized status queries, DVC
makes data synchronization incredibly fast. Coupled with the ability to quickly
identify which files remain unchanged between sync operations, DVC 1.0 is a
powerful data management tool.</p>
<p>Whether you are upgrading from a prior DVC release, or trying DVC for the first
time, we hope that all of our users are able to benefit from these new
optimizations. DVC performance is an important issue, and our team is looking
forward to working on further
<a href="https://github.com/iterative/dvc/labels/performance" target="_blank" rel="nofollow noopener noreferrer">performance optimizations</a>
in the future - across all areas in DVC, not just remotes.</p>
<p>As always, if you have any questions, comments or suggestions regarding DVC
performance, please feel free to connect with the DVC community on
<a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>, <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/november-20-community-gemshttps://dvc.org/blog/november-20-community-gemsWed, 25 Nov 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-if-i-checkout-a-different-git-branch-how-do-i-synchronize-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/773498570795778058" target="_blank" rel="nofollow noopener noreferrer">Q: If I checkout a different Git branch, how do I synchronize with DVC?</a><a href="#q-if-i-checkout-a-different-git-branch-how-do-i-synchronize-with-dvc" aria-label="q if i checkout a different git branch how do i synchronize with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Here's what we recommend: when you checkout a different Git branch in your
project:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> <span class="token operator"><</span>my_great_new_branch<span class="token operator">></span></span></code></pre></div>
<p>you'll want to next run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div>
<p>to synchronize your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files on that branch. But <em>did you know</em> you can
automate this with a <code>post-checkout</code> Git hook? We've got a hook that executes
<a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> whenever you run <code>git checkout</code>, so you'll always have the
correct data file versions. Head to our docs to
<a href="https://dvc.org/doc/command-reference/install#install" target="_blank" rel="nofollow noopener noreferrer">read up on installing Git hooks into your DVC repository</a>
so you never forget to <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>!</p>
<h3 id="q-i-have-a-big-100-gb-directory-i-want-to-know-where-the-contents-are-located-so-i-can-open-them-with-spark--is-there-a-way-to-get-the-location-of-my-files-without-caching-them-locally" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/771386223403073587" target="_blank" rel="nofollow noopener noreferrer">Q: I have a big, 100 GB directory. I want to know where the contents are located so I can open them with Spark- is there a way to get the location of my files without caching them locally?</a><a href="#q-i-have-a-big-100-gb-directory-i-want-to-know-where-the-contents-are-located-so-i-can-open-them-with-spark--is-there-a-way-to-get-the-location-of-my-files-without-caching-them-locally" aria-label="q i have a big 100 gb directory i want to know where the contents are located so i can open them with spark is there a way to get the location of my files without caching them locally permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For this, we'd recommend the
<a href="https://dvc.org/doc/api-reference/get_url#dvcapiget_url" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a>'s
<code>get_url</code> function. For example, in a Python script you'd write:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api
resource_url <span class="token operator">=</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>get_url<span class="token punctuation">(</span>
<span class="token string">"<top-level-directory>"</span><span class="token punctuation">,</span>
repo<span class="token operator">=</span><span class="token string">"https://github.com/<your-repo>"</span><span class="token punctuation">)</span>
<span class="token punctuation">)</span></code></pre></div>
<p>This code means the API will return the URL for a file that ends in <code>.dir</code>. The
<code>.dir</code> file contains a JSON-formatted table of the hashes and relative paths for
all the files inside <code><top-level-directory></code>. You could then parse that file to
get the relative paths to the files in your remote storage.</p>
<p>The JSON object will look something like this, for a file <code>foo/bar</code> in your
project:</p>
<div class="gatsby-highlight" data-language="json"><pre class="language-json"><code class="language-json"><span class="token punctuation">{</span> <span class="token property">"md5"</span><span class="token operator">:</span> <span class="token string">"abcd123"</span><span class="token punctuation">,</span> <span class="token property">"relpath"</span><span class="token operator">:</span> <span class="token string">"foo/bar"</span> <span class="token punctuation">}</span></code></pre></div>
<p>Then you can convert the relative path to <code>foo/bar</code> to an absolute path as
follows:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">https://<path-to-your-remote-storage>/ab/cd123</code></pre></div>
<p>To better understand how DVC uses
<a href="https://en.wikipedia.org/wiki/Content-addressable_storage" target="_blank" rel="nofollow noopener noreferrer">content-addressable storage</a>
in your remote,
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read up in our docs</a>.</p>
<h3 id="q-can-i-have-more-than-one-dvcyaml-file-in-my-project" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/777946398250893333" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have more than one <code>dvc.yaml</code> file in my project?</a><a href="#q-can-i-have-more-than-one-dvcyaml-file-in-my-project" aria-label="q can i have more than one dvcyaml file in my project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>By default, DVC pipelines records all your stages (and their inputs and outputs)
in a single file, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Per directory, you can have one <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.
If you want to run pipelines in a different folder than your project root, you
could create another <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in a subdirectory.</p>
<p>However, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> is intended to be the only file you need to record and
reproduce pipelines per directory. Pipelines are designed to have all stages
stored in the same place, and there's currently no method to rename <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p>
<h3 id="q-how-can-i-untrack-a-file-thats-being-tracked-by-dvc-i-want-to-remove-it-from-remote-storage-and-my-local-cache-too" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/773277514717462548" target="_blank" rel="nofollow noopener noreferrer">Q: How can I untrack a file that's being tracked by DVC? I want to remove it from remote storage and my local cache, too.</a><a href="#q-how-can-i-untrack-a-file-thats-being-tracked-by-dvc-i-want-to-remove-it-from-remote-storage-and-my-local-cache-too" aria-label="q how can i untrack a file thats being tracked by dvc i want to remove it from remote storage and my local cache too permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you want to untrack a file, perhaps something you added to DVC in error, you
can use <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> to get rid of the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file corresponding to your file,
and then clear your DVC cache with <a href="https://dvc.org/doc/command-reference/gc#-w"><code>dvc gc -w --cloud</code></a>.
<a href="https://dvc.org/doc/user-guide/how-to/stop-tracking-data" target="_blank" rel="nofollow noopener noreferrer">Check out our docs</a>
to learn more about <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> and what its flags mean (you'll want to be sure you
know what you're doing, since cache cleaning deletes files permanently!).</p>
<p>Alternatively, you can manually find and delete your files:</p>
<ol>
<li>Find the file using its hash from the corresponding <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file (or, if it's
part of a pipeline, the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file).</li>
<li>Look in your remote storage and remove the file matching the hash.</li>
<li>Look in <code>.dvc/cache</code> and remove the file as well. If you'd like to better
understand how your cache is organized,
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">we have docs for that</a>.</li>
</ol>
<p>Your DVC remote storage and cache are simply storage locations, so once your
file is gone from there it's gone for good.</p>
<h3 id="q-my-dvc-cache-is-getting-a-bit-big-can-i-clean-it" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/771275051382341674" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC cache is getting a bit big. Can I clean it?</a><a href="#q-my-dvc-cache-is-getting-a-bit-big-can-i-clean-it" aria-label="q my dvc cache is getting a bit big can i clean it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Definitely. Have you seen the command <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>? It helps you clean your local
cache- <a href="https://dvc.org/doc/command-reference/gc" target="_blank" rel="nofollow noopener noreferrer">read up here</a>. This function
lets you get granular about what you're keeping; for example, you can instruct
<a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> to preserve cache files that are currently used your local worksapce,
tips of Git branches, tagged Git commits or all Git commits. Everything else
will be removed.</p>
<p>One word of caution: make sure that when you collect garbage from your cache,
you don't delete any files that you haven't yet pushed to a remote. If this
happens, you'll delete them permanently. To be safe, it never hurts to
<a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> your files of interest before cleaning.</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-does-cml-support-bitbucket" style="position:relative;"><a href="https://github.com/iterative/cml/issues/140" target="_blank" rel="nofollow noopener noreferrer">Q: Does CML support Bitbucket?</a><a href="#q-does-cml-support-bitbucket" aria-label="q does cml support bitbucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We've just unrolled Bitbucket Cloud support! There are brand new docs in the CML
project repo,
<a href="https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud" target="_blank" rel="nofollow noopener noreferrer">so check them out</a>
to get started. A few quick notes to keep in mind:</p>
<ol>
<li>
<p>Like GitLab, Bitbucket Cloud requires you to create a token for authorizing
CML to write comments. Make sure you don't forget this step (it's in the
docs!) or you'll surely hit a permissions error.</p>
</li>
<li>
<p>Bitbucket Cloud uses Bitbucket Pipelines for continuous integration
workflows, which
<a href="https://jira.atlassian.com/browse/BCLOUD-16995" target="_blank" rel="nofollow noopener noreferrer">currently doesn't support self-hosted runners</a>.
That means
<a href="https://community.atlassian.com/t5/Bitbucket-questions/Does-bitbucket-pipe-support-GPUs-yet/qaq-p/1042659" target="_blank" rel="nofollow noopener noreferrer">bringing your own GPUs is not supported</a>.
Sorry! But you can still have all the other CML benefits of plots, tables and
text in your Pull Request.</p>
</li>
<li>
<p>Bitbucket Server support (with Jenkins and Bamboo) is under active
development. Stay tuned!</p>
</li>
</ol>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ae915fa598568bd5e8ca33c3922d398d/39600/bitbucket_cloud_pr.png" alt="bitbucket cloud pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Now your Bitbucket PRs
can be as pretty as you.</em></p>
<h3 id="q-can-i-use-cml-with-windows-runners" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/772519007894765600" target="_blank" rel="nofollow noopener noreferrer">Q: Can I use CML with Windows runners?</a><a href="#q-can-i-use-cml-with-windows-runners" aria-label="q can i use cml with windows runners permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>While all our CML tutorials and docs use Ubuntu runners of various flavors,
there's no problem with using Windows runners. Both
<a href="https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a>
and
<a href="https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/" target="_blank" rel="nofollow noopener noreferrer">GitLab CI</a>
have Windows runners up for grabs. And of course, you can set up your own
Windows machine as a self-hosted runner (see the self-hosted runner docs for
your CI system to learn more).</p>
<p>What if you have a GPU? If you want to use
<a href="https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus" target="_blank" rel="nofollow noopener noreferrer"><code>nvidia-docker</code> to put GPU drivers in your container</a>,
you'll want to use <code>nvidia-docker</code> with the Windows Subsytem for Linux (WSL).
That means you'll first install an Ubuntu subsystem on your Windows machine,
then all your Nvidia drivers, then Docker and <code>nvidia-docker</code>. Check out some
<a href="https://docs.nvidia.com/cuda/wsl-user-guide/index.html" target="_blank" rel="nofollow noopener noreferrer">more docs about CUDA with WSL</a>
to lear more.</p>
<h3 id="q-im-using-cml-to-deploy-a-self-hosted-runner-with-gitlab-i-noticed-that-in-your-docs-the-runner-is-always-set-to-timeout-after-1800-seconds-and-then-it-gets-unregistered-from-gitlab-what-if-i-want-to-keep-my-runner-registered-after-the-job-ends" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/779317571354099722" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using CML to deploy a self-hosted runner with GitLab. I noticed that in your docs, the runner is always set to timeout after 1800 seconds, and then it gets unregistered from GitLab. What if I want to keep my runner registered after the job ends?</a><a href="#q-im-using-cml-to-deploy-a-self-hosted-runner-with-gitlab-i-noticed-that-in-your-docs-the-runner-is-always-set-to-timeout-after-1800-seconds-and-then-it-gets-unregistered-from-gitlab-what-if-i-want-to-keep-my-runner-registered-after-the-job-ends" aria-label="q im using cml to deploy a self hosted runner with gitlab i noticed that in your docs the runner is always set to timeout after 1800 seconds and then it gets unregistered from gitlab what if i want to keep my runner registered after the job ends permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>With CML, we introduced an approach using Docker Machine to provision instances
in the cloud, and then use <code>dvc run</code> to register them as self-hosted runners to
completed your workflow. As this question points out, we like to set runners to
timeout after 1800 seconds- that's why you'll see this code in our
<a href="https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">sample "Cloud GPU" workflow</a>:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">docker</span> run <span class="token parameter variable">--name</span> myrunner <span class="token parameter variable">-d</span> <span class="token parameter variable">--gpus</span> all <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_IDLE_TIMEOUT</span><span class="token operator">=</span><span class="token number">1800</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_LABELS</span><span class="token operator">=</span>cml,gpu <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_REPO</span><span class="token operator">=</span><span class="token variable">$CI_SERVER_UR</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">repo_token</span><span class="token operator">=</span><span class="token variable">$REGISTRATION_TOKEN</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_DRIVER</span><span class="token operator">=</span>gitlab <span class="token punctuation">\</span>
iterativeai/cml:0-dvc2-base1-gpu runner</span></code></pre></div>
<p>We did this so you'll avoid running up GPU hours and a big bill. If you're not
worried about that, though, you can set the environmental variable
<code>RUNNER_IDLE_TIMEOUT</code> in the <code>dvcorg/cml</code> container to 0. Then, your self-hosted
runner will stay on forever, or at least until you manually turn it off.</p>
<p>By the way… stay tuned for a big update here. We're currently replacing the
Docker Machine approach with a method based on TerraForm, and we can't wait to
unveil it. It should make deploying cloud instances on AWS, GCP and Azure work
with less code than ever.</p>
<h3 id="q-what-did-deevee-do-for-thanksgiving" style="position:relative;">Q: What did DeeVee do for Thanksgiving?<a href="#q-what-did-deevee-do-for-thanksgiving" aria-label="q what did deevee do for thanksgiving permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>She stayed home and made mashed potatoes.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/252ccf20ce3c7a53778c4d2a07c2a99e/39600/deevee_n_taters.png" alt="deevee n taters" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>That's all for now, everyone! As always, keep in touch with all your questions
big and small.</p>https://dvc.org/blog/november-20-dvc-heartbeathttps://dvc.org/blog/november-20-dvc-heartbeatWed, 11 Nov 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Welcome to the November Heartbeat! Let's dive in with some news from the team.</p>
<h3 id="datacouncil-interviews-dmitry" style="position:relative;">DataCouncil interviews Dmitry<a href="#datacouncil-interviews-dmitry" aria-label="datacouncil interviews dmitry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/DataCouncilAI" target="_blank" rel="nofollow noopener noreferrer">Data Council</a>'s
<a href="https://twitter.com/petesoder?lang=en" target="_blank" rel="nofollow noopener noreferrer">Peter Soderling</a> interviewed CEO Dmitry!
Check out the recording from Data Council's live event, including Q&A from the
Data Council community, on YouTube.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/8dBCgIa7TGE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="were-hiring" style="position:relative;">We're hiring<a href="#were-hiring" aria-label="were hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Did you know we're hiring for two roles in our growing team? We're looking for:</p>
<ul>
<li>
<p>A
<a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Software Engineer</strong></a>
for the core DVC team- someone with strong Python development skills who can
build and ship essential DVC features.</p>
</li>
<li>
<p>A
<a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a>
to lead the community, support contributors and new users, and create new
content like blogs and videos about DVC and CML.</p>
</li>
</ul>
<p>Here are a few reasons to consider joining us:</p>
<ul>
<li>Your work will be visible and will be used by thousands developers every day!</li>
<li>We're a small, fully remote team. Work from anywhere!</li>
<li>Competitive salary and benefits</li>
<li>Family-friendly benefits, including unlimited PTO</li>
</ul>
<p>If you're interested, we'd love to hear from you about either role (and we
welcome referrals if you know a good candidate)!</p>
<h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We're continuing to develop our video docs, and now half of our "Getting
Started" section has video accompaniments. Check out our latest release on
<a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">data access with DVC</a>:</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EE7Gk84OZY8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>This video covers functions like <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>, <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>, and the DVC Python
API.</p>
<p>We took a quick break from releasing videos during the US election week, but
look out for a new video on our
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>
about model testing with continuous integration! Subscribe to get alerts
whenever we have something new :)</p>
<h3 id="workshops-and-conferences" style="position:relative;">Workshops and conferences<a href="#workshops-and-conferences" aria-label="workshops and conferences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>As usual, there are plenty of remote meetings on our schedules:</p>
<ul>
<li>
<p><a href="http://www.bootcamp.dadosesaude.com/" target="_blank" rel="nofollow noopener noreferrer">HealthData Bootcamp</a> is a weeklong
intensive for all things biomedical data science. Dmitry and myself (Elle),
plus DVC Ambassadors Mikhail Rozhkov and Marcel Ribeiro-Dantas, will be
presenting lectures and workshops about MLOps throughout the week!</p>
</li>
<li>
<p>I'll be leading a hands-on workshop at the
<a href="https://torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society Annual Meeting</a>.
It'll cover how to get started using
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a>(CML) with GitHub Actions-
<a href="https://torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">register here</a>, and be sure to reserve
your spot in the workshop.</p>
</li>
<li>
<p>This week, I have another talk at <a href="https://global.pydata.org/" target="_blank" rel="nofollow noopener noreferrer">PyData Global</a>
about CML. PyData Global is online for the first time ever and promises to be
a great gathering for Python-using data scientists in industry and academic
research alike.</p>
</li>
</ul>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some of our favorite happenings around the MLOps community this week.</p>
<h3 id="a-new-online-course" style="position:relative;">A new online course<a href="#a-new-online-course" aria-label="a new online course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer">Goku Mohandas</a>, founder of
<a href="https://twitter.com/madewithml" target="_blank" rel="nofollow noopener noreferrer">Made with ML</a>, announced plans to release a new
online course about putting ML in production. The curriculum will cover
everything from experiment tracking to deploying and monitoring models in
production, and you can expect DVC to be included! Keep an eye on Goku and Made
with ML on Twitter for updates.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🔥 Putting ML in Production! We're going to publicly develop <a href="https://twitter.com/MadeWithML">@madewithml</a>'s first ML service. Here is the broad curriculum: <br><br>- 📦 Product<br>- 🔢 Data<br>- 🤖 Modeling<br>- 📝 Scripting<br>- 🛠 API<br>- 🚀 Production<br><br>More details (lessons, task, etc.) here: <a href="https://t.co/xmMm9XGK9j">https://t.co/xmMm9XGK9j</a><br><br>Thread 👇 <a href="https://t.co/T0uLPb2QbR">pic.twitter.com/T0uLPb2QbR</a></p>— Goku Mohandas (@GokuMohandas) <a href="https://twitter.com/GokuMohandas/status/1315990996849627136">October 13, 2020</a></blockquote>
<h3 id="our-favorite-blogs" style="position:relative;">Our favorite blogs<a href="#our-favorite-blogs" aria-label="our favorite blogs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/visenger" target="_blank" rel="nofollow noopener noreferrer">Dr. Larysa Visengeriyeva</a>, creator of the
top-notch
<a href="https://github.com/visenger/awesome-mlops" target="_blank" rel="nofollow noopener noreferrer">"Awesome MLOps" GitHub repo</a>, and
DevOps expert Anja Kammer wrote a must-read essay about CI/CD for ML (note: it's
published in German; I used Chrome's built-in translation to read in English).</p>
<p>The blog covers key concepts like continuous integration, deployment, and
training with ML, as well as practical approaches and sample architectures.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.innoq.com/de/articles/2020/10/mlops-operations-fuer-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps: You Train It, You Run It!</h4>
<div class="elp-description">CI / CD & Operations for machine learning</div>
<div class="elp-link">innoq.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-11-08/innoq-35328b26bf404a8d5892cea6cae83fb3.png" alt="MLOps: You Train It, You Run It!">
</div>
</a>
</section>
<p></p>
<p><em>Also</em>, there's some cool art.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b38952332c3d2bbd69dbeb9cf47fa685/39600/mlops_diagram.png" alt="mlops diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Another blog on our radar: <a href="https://twitter.com/lopp_sean" target="_blank" rel="nofollow noopener noreferrer">Sean Lopp</a> at
<a href="https://twitter.com/rstudio" target="_blank" rel="nofollow noopener noreferrer">RStudio</a> made the first known blog about a CML
report with a ggplot! Using RStudio's
<a href="https://github.com/r-lib/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions for R</a> and CML, Sean built a
sample data science workflow that runs automatically in GitHub Actions on a
push. He reports on some pros, cons, and areas for future development to make R
language data science easy to automate.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://loppsided.blog/posts/2020-10-26-tidymodels-dvc-mashup/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Tidymodels DVC Mashup</h4>
<div class="elp-description">Using Github Actions and Data Version Control for ModelOps in R</div>
<div class="elp-link">loppsided.blog</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-11-08/sean_lopp-6c26e81b394a7c61ebab8e0dc7f00e56.jpg" alt="Tidymodels DVC Mashup">
</div>
</a>
</section>
<p></p>
<p>Finally, developer <a href="https://twitter.com/stribny" target="_blank" rel="nofollow noopener noreferrer">Petr Stribny</a> wrote about how
to version big files in a Git project with DVC. It's a short-and-sweet guide to
getting started, and if you're trying to decide if DVC is for you, this is worth
a look.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://stribny.name/blog/2020/10/versioning-large-files-in-git-with-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Versioning large files in git with DVC</h4>
<div class="elp-description">Software development and beyond</div>
<div class="elp-link">stribny.name</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-11-08/petr-f92d049b5032e322835f05e74cc215f7.jpg" alt="Versioning large files in git with DVC">
</div>
</a>
</section>
<p></p>
<h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To wrap it up, here's a kind tweet that we really like. It's always good to be
mentioned in the same tweet as some of our heroes :)</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Companies such as <a href="https://twitter.com/astronomerio">@astronomerio</a>, <a href="https://twitter.com/HashiCorp">@HashiCorp</a>, <a href="https://twitter.com/supabase_io">@supabase_io</a>, <a href="https://twitter.com/Iterativeai">@Iterativeai</a> are excellent examples of companies with a relentless focus on building for developer love.</p>— Ethan Batraski (@ethanjb) <a href="https://twitter.com/ethanjb/status/1316833012676354048">October 15, 2020</a></blockquote>
<p>Thanks for reading this month!</p>https://dvc.org/blog/october-20-community-gemshttps://dvc.org/blog/october-20-community-gemsMon, 26 Oct 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-whats-in-a-dvc-file-and-what-would-happen-if-decided-not-push-my-dvc-files-to-my-git-repo" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/760920403064520755" target="_blank" rel="nofollow noopener noreferrer">Q: What's in a <code>.dvc</code> file, and what would happen if decided not push my <code>.dvc</code> files to my Git repo?</a><a href="#q-whats-in-a-dvc-file-and-what-would-happen-if-decided-not-push-my-dvc-files-to-my-git-repo" aria-label="q whats in a dvc file and what would happen if decided not push my dvc files to my git repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC creates lightweight metafiles (<a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files) that correspond to large
artifacts in your project. These <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files contain pointers to your artifacts
in remote storage (we use a simple content-based storage scheme). Because we use
content-based storage, the remote storage itself isn't designed for browsing
(although
<a href="https://github.com/iterative/dvc/issues/3621" target="_blank" rel="nofollow noopener noreferrer">there are some discussions</a> about
how to make stored files more "discoverable", and you can always identify them
manually by their contents and meta-information like timestamps).</p>
<p>Your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files help establish meaningful links between human-readable
filenames and file contents in remote storage, as well as to use Git versioning
on your stored datasets and models. You can think of your DVC remote storage as
a <em>compliment</em> to your Git repository, not a replacement.</p>
<p>In other words… if you're not Git versioning your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files, you're not
versioning anything in DVC remote storage!</p>
<h3 id="q-can-i-limit-the-number-of-network-connections-used-by-dvc-during-dvc-pull" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/739760523293360182" target="_blank" rel="nofollow noopener noreferrer">Q: Can I limit the number of network connections used by DVC during <code>dvc pull</code>?</a><a href="#q-can-i-limit-the-number-of-network-connections-used-by-dvc-during-dvc-pull" aria-label="q can i limit the number of network connections used by dvc during dvc pull permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yep- by default, DVC data transfer operations use a number of threads
proportional to the number of CPUs detected. But, there's a handy flag for
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> that lets you override the defaults:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">-j <number>, --jobs <number> - number of threads to run
simultaneously to handle the downloading of files from
the remote. The default value is 4 * cpu_count(). For
SSH remotes, the default is just 4. Using more jobs may
improve the total download speed if a combination of small
and large files are being fetched.</code></pre></div>
<h3 id="q-im-working-on-a-multi-class-classification-task-can-dvc-plots-show-multiple-precision-recall-curves--one-for-each-class" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/765117500530491472" target="_blank" rel="nofollow noopener noreferrer">Q: I'm working on a multi-class classification task. Can <code>dvc plots</code> show multiple precision recall curves- one for each class?</a><a href="#q-im-working-on-a-multi-class-classification-task-can-dvc-plots-show-multiple-precision-recall-curves--one-for-each-class" aria-label="q im working on a multi class classification task can dvc plots show multiple precision recall curves one for each class permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Currently, <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a> doesn't support multiple linear curves on a single plot
(except for <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a>, of course!). But, you could make one precision
recall curve per class and display them side-by-side.</p>
<p>To do this, you'd want to write the precision recall curve values to separate
files for each class (<code>prc-0.json</code>,<code>prc-1.json</code>, etc.). Then you would run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots show</span> prc-0.json prc-1.json</span></code></pre></div>
<p>And you'll see two plots side-by-side! A benefit of this approach is that when
you run <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a> to compare precision recall curves across Git commits,
you'll get a comparison plotted for each class.</p>
<h3 id="q-are-you-sure-i-should-commit-my-dvcconfig-file-it-contains-my-logging-credentials-for-storage-and-im-nervous-about-adding-it-to-a-shared-git-repository" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/768770079596740650" target="_blank" rel="nofollow noopener noreferrer">Q: Are you sure I should commit my <code>.dvc/config</code> file? It contains my logging credentials for storage, and I'm nervous about adding it to a shared Git repository.</a><a href="#q-are-you-sure-i-should-commit-my-dvcconfig-file-it-contains-my-logging-credentials-for-storage-and-im-nervous-about-adding-it-to-a-shared-git-repository" aria-label="q are you sure i should commit my dvcconfig file it contains my logging credentials for storage and im nervous about adding it to a shared git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a common scenario- you don't necessarily want to broadcast your remote
storage credentials to everyone on your team, but you still want to check-in
your DVC setup (meaning, your <code>.dvc/config</code> file). In this case, you want to use
a <code>local</code> config file!</p>
<p>You can use the command</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> <span class="token parameter variable">--local</span></span></code></pre></div>
<p>to setup remote credentials that will be stored in <code>.dvc/config.local</code>- by
default, this file is in your <code>.gitignore</code> so you don't have to worry about
accidentally committing secrets to your Git repository.
<a href="https://dvc.org/doc/command-reference/config" target="_blank" rel="nofollow noopener noreferrer">Check out the docs</a> for more,
including the <code>--system</code> and <code>--global</code> options for setting your configuration
for multiple projects and users respectively.</p>
<h2 id="cml-questions" style="position:relative;">CML Questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-whats-the-file-size-limit-for-publishing-files-with-cml-publish" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/751001285100306502" target="_blank" rel="nofollow noopener noreferrer">Q: What's the file size limit for publishing files with <code>cml publish</code>?</a><a href="#q-whats-the-file-size-limit-for-publishing-files-with-cml-publish" aria-label="q whats the file size limit for publishing files with cml publish permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><code>cml publish</code> is a service for hosting files that are embedded in CML reports,
like images, audio files, and GIFS. By default, we have a limit of 2 MB per
upload.</p>
<p>If your files are larger than this (which can happen, depending on the machine
learning problem you're working on!) we recommend using GitLab's artifact
storage.
<a href="https://github.com/iterative/cml/issues/232" target="_blank" rel="nofollow noopener noreferrer">Based on discussions in the community</a>,
we recently implemented a CML flag (<code>--gitlab-uploads</code>) to streamline the
process:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cml</span> publish movie.mov <span class="token parameter variable">--md</span> <span class="token parameter variable">--gitlab-uploads</span> <span class="token operator">></span> report.md</span></code></pre></div>
<p>Note that we don't currently have an analagous solution for GitHub, because
GitHub artifacts expire after 90 days (whereas they're permanent in GitLab).</p>
<h3 id="q-im-getting-a-mysterious-error-message-failed-guessing-mime-type-of-file-when-i-try-to-use-cml-publish-whats-going-on" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/763840404675756042" target="_blank" rel="nofollow noopener noreferrer">Q: I'm getting a mysterious error message, <code>Failed guessing mime type of file</code>, when I try to use <code>cml publish</code>. What's going on?</a><a href="#q-im-getting-a-mysterious-error-message-failed-guessing-mime-type-of-file-when-i-try-to-use-cml-publish-whats-going-on" aria-label="q im getting a mysterious error message failed guessing mime type of file when i try to use cml publish whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This error message usually means that the target of <code>cml publish</code>- for example,</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cml</span> publish <span class="token operator"><</span>target file<span class="token operator">></span></span></code></pre></div>
<p>is not found. Check for typos in the target filename and ensure that the file
was in fact generated during the run (if it isn't part of your Git repository).
We've <a href="https://github.com/iterative/cml/issues/308" target="_blank" rel="nofollow noopener noreferrer">opened an issue</a> to add a
more informative error message in the future.</p>
<h3 id="q-in-my-github-actions-workflow-i-use-dvc-metrics-diff-to-compare-metrics-generated-during-the-run-to-metrics-on-the-main-branch-and-print-a-table--but-the-table-isnt-showing-any-of-the-metrics-from-main-what-could-be-happening" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/768815157034876929" target="_blank" rel="nofollow noopener noreferrer">Q: In my GitHub Actions workflow, I use <code>dvc metrics diff</code> to compare metrics generated during the run to metrics on the main branch and print a table- but the table isn't showing any of the metrics from <code>main</code>. What could be happening?</a><a href="#q-in-my-github-actions-workflow-i-use-dvc-metrics-diff-to-compare-metrics-generated-during-the-run-to-metrics-on-the-main-branch-and-print-a-table--but-the-table-isnt-showing-any-of-the-metrics-from-main-what-could-be-happening" aria-label="q in my github actions workflow i use dvc metrics diff to compare metrics generated during the run to metrics on the main branch and print a table but the table isnt showing any of the metrics from main what could be happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When a continuous integration runner won't report metrics from previous versions
of your project (or other branches), that's usually a sign that the runner
doesn't have access to the full Git history of your project or your metrics
themselves. Here are a few things to check for:</p>
<ol>
<li><strong>Did you fetch your Git working tree in the runner?</strong> Functions like
<a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a> require the Git history to be accessible- make sure that
in your workflow, before you run this function, you've done a <code>git fetch</code>. We
recommend:</li>
</ol>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git fetch</span> <span class="token parameter variable">--prune</span> <span class="token parameter variable">--unshallow</span></span></code></pre></div>
<ol start="2">
<li>
<p><strong>Are your metrics in your DVC remote?</strong> If your metrics are <em>cached</em> (which
they are by default when you create a DVC pipeline), your DVC remote should
be accessible to your runner. That means you need to add any credentials as
repository secrets (or variables, in GitLab), and do <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> in your
workflow before attempting <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a>.</p>
</li>
<li>
<p><strong>Are your metrics in your local workspace?</strong> If you are <em>not</em> using a DVC
remote, your metric files must be <em>uncached</em> and committed to your Git
repository. To explore an example, say you have a pipeline stage that creates
<code>metric.json</code>:</p>
</li>
</ol>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> mystage <span class="token parameter variable">-m</span> metric.json train.py</span></code></pre></div>
<p>By default, <code>metric.json</code> is cached and ignored by Git- which means that if you
aren't using a DVC remote in your CI workflow, <code>metric.json</code> will effectively be
abandoned on your local machine! You can avoid this by using the <code>-M</code> flag
instead of <code>-m</code> in <code>dvc run</code>, or manually adding the field <code>cache: false</code> to
your metric in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Be sure to remove your metrics from any <code>.gitignore</code>
files, and commit and push them to your Git repository.</p>
<p>That's all for this month- Happy Halloween! Watch out for scary bugs. 🐛</p>https://dvc.org/blog/october-20-dvc-heartbeathttps://dvc.org/blog/october-20-dvc-heartbeatMon, 12 Oct 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="paweł-gets-ready-to-speak-at-polands-largest-data-science-meeting" style="position:relative;">Paweł gets ready to speak at Poland's largest data science meeting<a href="#pawe%C5%82-gets-ready-to-speak-at-polands-largest-data-science-meeting" aria-label="paweł gets ready to speak at polands largest data science meeting permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC developer Paweł Redzyński (he's written a lot of the code behind
<a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a>) is giving at talk at the <a href="https://dssconf.pl/" target="_blank" rel="nofollow noopener noreferrer">Data Science Summit</a>
in Poland! The virtual meeting is on October 16, but talks are available for
streaming on demand up to a week before. Paweł's talk is part of the DataOps &
Development track, where he'll be sharing about CML and GitHub Actions (note
that it'll be delivered in English).</p>
<p><a href="https://dssconf.pl" target="_blank" rel="nofollow noopener noreferrer"><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4af5a02cc92cd39e8cc7a546e1cbada8/39600/dss.png" alt="dss" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></a></p>
<h3 id="dmitry-talks-at-data-engineering-melbourne" style="position:relative;">Dmitry talks at Data Engineering Melbourne<a href="#dmitry-talks-at-data-engineering-melbourne" aria-label="dmitry talks at data engineering melbourne permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>CEO
<a href="https://www.meetup.com/Data-Engineering-Melbourne/events/267033998/" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov dropped into the Data Engineering Melbourne meetup</a>
to talk about Data Versioning and DataOps! He spoke about the differences
between end-to-end platforms and ecosystems of tools, and how this distinction
informs the development of software like DVC and CML (hint: we picked tools over
platforms).</p>
<p>Keep an eye on this meetup, which is now accessible to folks on all continents
thanks to the magic of the internet :)</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/Data-Engineering-Melbourne/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Engineering Melbourne</h4>
<div class="elp-description">Dmitry Petrov presents on DataOps and versioning.</div>
<div class="elp-link">meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-10-12/Meetup_Logo-04501e404e41367b16280cd0515d54df.png" alt="Data Engineering Melbourne">
</div>
</a>
</section>
<p></p>
<h3 id="elle-has-talks-at-pycon-india-and-pydata-global" style="position:relative;">Elle has talks at PyCon India and PyData Global<a href="#elle-has-talks-at-pycon-india-and-pydata-global" aria-label="elle has talks at pycon india and pydata global permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Last week I gave a talk about CML at
<a href="https://in.pycon.org/cfp/2020/proposals/how-to-make-continuous-integration-work-with-machine-learning~avK5b/" target="_blank" rel="nofollow noopener noreferrer">PyCon India</a>,
and have another one coming up at
<a href="https://global.pydata.org/talks/321" target="_blank" rel="nofollow noopener noreferrer">PyData Global</a> this November 11-15.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://global.pydata.org/talks/321" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DevOps for science: using continuous integration for rigorous and reproducible analysis</h4>
<div class="elp-description">PyData Global</div>
<div class="elp-link">https://global.pydata.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-10-12/pydata-4857264047a11851293de84b3c988b3d.png" alt="DevOps for science: using continuous integration for rigorous and reproducible analysis">
</div>
</a>
</section>
<p></p>
<p>PyData Global has a fantastic lineup of talks spanning science and engineering,
so please consider joining!</p>
<h3 id="dvc-at-datafest" style="position:relative;">DVC at DataFest<a href="#dvc-at-datafest" aria-label="dvc at datafest permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC Ambassador Mikhail Rozhkov co-hosted the Machine Learning REPA
(Reproducibility, Experiments and Pipelines Automation) track of
<a href="https://datafest.ru/" target="_blank" rel="nofollow noopener noreferrer">DataFest 2020</a>, and DVC showed up in full force! There
were talks from Dmitry, ambassador Marcel Ribeiro-Dantas, and myself about all
aspects of MLOps and automation.</p>
<p>DataFest is over (until next year, anyway), but
<a href="http://ml-repa.ru/en#about" target="_blank" rel="nofollow noopener noreferrer">visit the ML-REPA community</a> for ongoing content
and opportunities for networking.</p>
<h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Since the summer, we've been building our
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>.
It's going great- we've gotten more than 18,000 views in the last few months and
1,500 subscribers!</p>
<p>Our latest video in the
<a href="https://www.youtube.com/playlist?list=PL7WG7YrwYcnDBDuCkFbcyjnZQrdskFsBz" target="_blank" rel="nofollow noopener noreferrer">MLOps Tutorials</a>
series introduced using GitHub Actions for model testing- instead of training a
model in continuous integration, the idea is to train locally and "check-in"
your favorite model for testing in a standardized environment. This approach
lets you completely control the environment, infrastructure, and code used to
evaluate your model, and save the run in a place that's easy to share (GitHub!).</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/bSXUJRnQPPo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>We'll be going deeper into the art and craft of testing ML models in the next
few weeks, so stay tuned. Another big initative is adding videos to our docs:
since video seems like a popular format for a lot of learners, we're working to
supplement our official docs with embedded videos. Check out our first
installment on the
<a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">Getting Started with Data Versioning</a>.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/kLKBcPonMYw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our community makes some amazing tutorials. Here are a few on our radar:</p>
<p>Data scientist and full-stack developer
<a href="https://github.com/ashutosh1919" target="_blank" rel="nofollow noopener noreferrer">Ashutosh Hathidara</a> shared an end-to-end
machine learning project made with DVC and CML… and released it in video form!
It's a neat setup and a nice model for folks to study.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/H1VBsK7XiKs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>Another detailed and easy-to-follow tutorial, with a similarly impressive scope,
appeared on <a href="https://www.heise.de/" target="_blank" rel="nofollow noopener noreferrer">Heise Online</a>. This project puts together
DVC, Cortex, and ONNX to develop and deploy a model trained on the Fashion MNIST
dataset (note: the article is in German, and I read it with Chrome's English
translation).</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.heise.de/hintergrund/Verwaltung-und-Inbetriebnahme-von-ML-Modellen-4911723.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Managing and commissioning ML models</h4>
<div class="elp-description">Tools like DVC and Cortex, which are designed for the operationalization of AI projects, are intended to help developers deploy models in production.</div>
<div class="elp-link">https://heise.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-10-12/heise-2f146c12022ee732eed47276fbd88d8d.png" alt="Managing and commissioning ML models">
</div>
</a>
</section>
<p></p>
<p>You'll also want to check out <a href="https://www.anno.ai/" target="_blank" rel="nofollow noopener noreferrer">anno.ai</a>'s tutorial about
managing large datasets with DVC and S3 storage- it's detailed, but also a
quick-start guide informed by the team's practical experience.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@anno.ai/mlops-and-data-managing-large-ml-datasets-with-dvc-and-s3-part-1-d5b8f2fb8280" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)</h4>
<div class="elp-description">A quick start guide to version control for machine learning data</div>
<div class="elp-link">medium.com/@anno.ai</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-10-12/legos-b1ceab755de4875476325388196a546a.jpg" alt="MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)">
</div>
</a>
</section>
<p></p>
<p>Data scientist and mathematician <a href="https://twitter.com/KhuyenTran16" target="_blank" rel="nofollow noopener noreferrer">Khuyen Tran</a>
blogged about why and how to start using DVC- and her tutorial includes Google
Drive remote storage, a feature we're especially excited about. Check it out and
follow along with her code examples!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-7cb49c229fe0" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Introduction to DVC: Data Version Control Tool for Machine Learning Projects</h4>
<div class="elp-description">Just like Git, but with Data!</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-10-12/khuyen_tran-f5684c74ede8821217b19dbcf295d7ec.jpg" alt="Introduction to DVC: Data Version Control Tool for Machine Learning Projects">
</div>
</a>
</section>
<p></p>
<p>And to end on a thoughtful note… have you seen this thread by ML Engineer
<a href="https://twitter.com/sh_reya" target="_blank" rel="nofollow noopener noreferrer">Shreya Shankar</a>? She beautifully summarizes many
of the ideas and technical challenges our community thinks about every day. Read
and reflect!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.<br><br>In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)</p>— Shreya Shankar (@sh_reya) <a href="https://twitter.com/sh_reya/status/1314338372073263112">October 8, 2020</a></blockquote>https://dvc.org/blog/september-20-community-gemshttps://dvc.org/blog/september-20-community-gemsMon, 28 Sep 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-when-i-try-to-push-to-my-dvc-remote-i-get-an-error-about-my-ssh-rsa-keys-whats-going-on" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/748735263634620518" target="_blank" rel="nofollow noopener noreferrer">Q: When I try to push to my DVC remote, I get an error about my SSH-RSA keys. What's going on?</a><a href="#q-when-i-try-to-push-to-my-dvc-remote-i-get-an-error-about-my-ssh-rsa-keys-whats-going-on" aria-label="q when i try to push to my dvc remote i get an error about my ssh rsa keys whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you're using DVC with an SSH-protected remote, DVC uses a Python library
called <code>paramiko</code> to create a connection to your remote. There is a
<a href="https://stackoverflow.com/questions/51955990/base64-decoding-error-incorrect-padding-when-loading-putty-ppk-private-key-to" target="_blank" rel="nofollow noopener noreferrer">known issue</a>
that <code>paramiko</code> expects RSA keys in OpenSSH key format, and can throw an error
if the keys are in an alternative format (such as default PuTTY formatted keys).
If this is the case, you'll likely see:</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ERROR: unexpected error - ('... ssh-rsa ...=', Error('Incorrect padding',))</code></pre></div>
<p>To fix this, convert your RSA key to the OpenSSH format. Tools like
<a href="https://www.puttygen.com/" target="_blank" rel="nofollow noopener noreferrer">PuTTYgen</a> and
<a href="https://mobaxterm.mobatek.net/" target="_blank" rel="nofollow noopener noreferrer">MobaKeyGen</a> can help you do this.</p>
<h3 id="q-can-i-have-multiple-paramyaml-files-in-a-project" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/753322309942509578" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have multiple <code>param.yaml</code> files in a project?</a><a href="#q-can-i-have-multiple-paramyaml-files-in-a-project" aria-label="q can i have multiple paramyaml files in a project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, you can have as many separate parameter files as you'd like. It's only
important that they are correctly specified in your DVC pipeline stages.</p>
<p>For example, if you have files <code>params_data_processing.yaml</code> and
<code>params_model.yaml</code> in your project (perhaps to store hyperparameters of your
data processing and model fitting stages, respectively), you'll want to call the
right file at each stage. For example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> preprocess <span class="token punctuation">\</span>
<span class="token parameter variable">-p</span> params_data_process.yaml:param1,param2,<span class="token punctuation">..</span>.</span></code></pre></div>
<h3 id="q-is-there-a-way-to-automatically-produce-svg-plots-from-dvc-plot-i-dont-like-having-to-click-through-the-vega-lite-gui-to-get-an-svg-and-my-plots-look-so-small-when-i-access-them-in-the-browser" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/750012082149392414" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to automatically produce SVG plots from <code>dvc plot</code>? I don't like having to click through the Vega-Lite GUI to get an SVG, and my plots look so small when I access them in the browser.</a><a href="#q-is-there-a-way-to-automatically-produce-svg-plots-from-dvc-plot-i-dont-like-having-to-click-through-the-vega-lite-gui-to-get-an-svg-and-my-plots-look-so-small-when-i-access-them-in-the-browser" aria-label="q is there a way to automatically produce svg plots from dvc plot i dont like having to click through the vega lite gui to get an svg and my plots look so small when i access them in the browser permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If your DVC plots (and by DVC plots, we mean Vega-Lite plots 😉) look small in
your browser, you can modify this programmatically! DVC generates Vega-Lite
plots by way of a few templates that come pre-loaded. The templates are in
<code>.dvc/plots</code> (assuming you're in a DVC directory).</p>
<p>Find the template that corresponds to your plot (if you didn't specify a plot
type in your CLI command, it's probably <code>default.json</code>) and modify the <code>height</code>
and <code>width</code> paramters. Then save your changes.</p>
<p>For more about how to modify your plot templates, check out the
<a href="https://vega.github.io/vega/docs/specification/" target="_blank" rel="nofollow noopener noreferrer">Vega docs</a>. If you're
considering making a whole new template that's custom for your data viz needs,
<a href="https://dvc.org/doc/command-reference/plots#custom-templates" target="_blank" rel="nofollow noopener noreferrer">we've got docs on that</a>,
too.</p>
<p>One last tip: did you know about the
<a href="https://anaconda.org/conda-forge/vega-lite-cli" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite CLI</a>? It provides
functions for converting Vega-Lite plots to <code>.pdf</code>,<code>.png</code>,<code>.svg</code>, and <code>.vg</code>
(Vega) formats. To use this approach with DVC, you'll want to use the
<code>--show-vega</code> flag to print your plot specification to a <code>.json</code> file.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots</span> <span class="token parameter variable">--show-vega</span> <span class="token operator">></span> vega.json
</span><span class="token line"><span class="token input">$ </span><span class="token command">vl2svg</span> vega.json</span></code></pre></div>
<h3 id="q-im-confused-about-external-dependencies-and-outputs-whats-the-difference" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/752478399326453840" target="_blank" rel="nofollow noopener noreferrer">Q: I'm confused about external dependencies and outputs. What's the difference?</a><a href="#q-im-confused-about-external-dependencies-and-outputs-whats-the-difference" aria-label="q im confused about external dependencies and outputs whats the difference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In short, external outputs and dependencies are files or directories that are
tracked by DVC, but physically reside outside of the local workspace. This could
happen for a few reasons:</p>
<ul>
<li>You want to version a dataset in cloud storage that is too large to transfer
to your local workspace efficiently</li>
<li>Your DVC pipeline writes directly to cloud storage</li>
<li>Your DVC pipeline depends on a dataset or other file in cloud storage</li>
</ul>
<p>An <strong>external output</strong> is declared in two ways: for example, if you have a file
<code>data.csv</code> in S3 storage, you can use
<a href="https://dvc.org/doc/command-reference/add#--external"><code>dvc add --external s3://mybucket/data.csv</code></a> to begin DVC tracking the file
(<a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">there are plenty more details and tips about managing external data in our docs</a>)).
You can also declare <code>data.csv</code> as an output of a DVC pipeline with
<code>dvc run -o s3://mybucket/data.csv</code>.</p>
<p>An <strong>external dependency</strong> is a dependency of a DVC pipeline that resides in
cloud storage. It's declared with the syntax
<code>dvc run -d s3://mybucket/data.csv</code>.</p>
<p>One other difference to note: DVC doesn't cache external dependencies; it merely
checks if they have changed when you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>. On the other hand, DVC
<em>does</em> cache external outputs. You'll want to set up an
<a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache#configure-the-shared-cache" target="_blank" rel="nofollow noopener noreferrer">external cache</a>
in the same remote location where your files are stored. This is because the
default cache location (in your local workspace) no longer makes sense when the
dataset never "visits" your local workspace! An external cache works largely the
same as a typical cache in your workspace.</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-how-can-i-use-cml-with-my-own-docker-container" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/757553135840526376" target="_blank" rel="nofollow noopener noreferrer">Q: How can I use CML with my own Docker container?</a><a href="#q-how-can-i-use-cml-with-my-own-docker-container" aria-label="q how can i use cml with my own docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In many of our CML docs and videos, we've shown how to get CML on your CI
(continuous integration) runner via a Docker container that comes with
everything installed. But this is not the only way to use CML, especially if you
want workflows to run in your own Docker container.</p>
<p>You can install CML via <code>npm</code>, either in your own Docker container or in your CI
workflow (i.e., in your GitHub Actions <code>.yaml</code> or GitLab CI <code>.yml</code> workflow
file).</p>
<p>To install CML as a package, you'll want to run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</code></pre></div>
<p>Note that you may need to install additional dependencies if you want to use DVC
plots and Vega-Lite commands:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">sudo</span> <span class="token function">apt-get</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev <span class="token punctuation">\</span>
librsvg2-dev libfontconfig-dev
$ <span class="token function">npm</span> <span class="token function">install</span> <span class="token parameter variable">-g</span> vega-cli vega-lite</code></pre></div>
<p>If you're installing CML as part of your workflow, you may need to install Node
first-
<a href="https://github.com/iterative/cml#install-cml-as-a-package" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a>
for how to do this in GitHub Actions and GitLab CI.</p>
<h3 id="q-after-running-a-github-action-workflow-that-runs-a-dvc-pipeline-i-want-to-save-the-output-of-the-pipeline-why-doesnt-cml-automatically-save-the-output" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/757686601953312988" target="_blank" rel="nofollow noopener noreferrer">Q: After running a GitHub Action workflow that runs a DVC pipeline, I want to save the output of the pipeline. Why doesn't CML automatically save the output?</a><a href="#q-after-running-a-github-action-workflow-that-runs-a-dvc-pipeline-i-want-to-save-the-output-of-the-pipeline-why-doesnt-cml-automatically-save-the-output" aria-label="q after running a github action workflow that runs a dvc pipeline i want to save the output of the pipeline why doesnt cml automatically save the output permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>By design, artifacts generated in a CI workflow aren't saved anywhere- they
disappear as soon as the runner shuts down. So a DVC pipeline executed in your
CI system might produce outputs, like transformed datasets and model files, that
will be lost at the end of the run. If you want to save them, there are a few
methods.</p>
<p>One approach is with auto-commits: a <code>git commit</code> at the end of your CI workflow
to commit any new artifacts to your Git repository. However, auto-commits have a
lot of downsides- they don't make sense for a lot of users, and generally, it's
better to re-create outputs as needed than save them forever in your Git repo.</p>
<p>We created the DVC <code>run-cache</code> in part
<a href="https://stackoverflow.com/questions/61245284/is-it-necessary-to-commit-dvc-files-from-our-ci-pipelines" target="_blank" rel="nofollow noopener noreferrer">to solve this issue</a>.
Here's how it works: you'll setup a DVC remote with access credentials passed to
your GitHub Action/GitLab CI via CML (see, for example,
<a href="https://github.com/iterative/cml_dvc_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">this workflow</a>).
Then you'll use the following protocol in your CI workflow (your workflow config
file in GitHub/GitLab):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--run-cache</span></span></code></pre></div>
<p>When you use this design, any artifacts of <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, such as models or
transformed datasets, will be saved in DVC storage and indexed by the pipeline
version that generated them. You can access them in your local workspace by
running</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div>
<p>While we think this is ideal for typical data science and machine learning
workflows, there are other approaches too- if you want to go deeper exploring
auto-commits, checkout the
<a href="https://github.com/marketplace/actions/add-commit" target="_blank" rel="nofollow noopener noreferrer">Add & Commit GitHub Action</a>.</p>
<h3 id="q-what-can-cml-do-that-circle-ci-cant-do" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=Ugylt6QR5ClmD8uHe4B4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: What can CML do that Circle CI can't do?</a><a href="#q-what-can-cml-do-that-circle-ci-cant-do" aria-label="q what can cml do that circle ci cant do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To be clear, CML isn't a competitor to Circle CI. Circle CI is more analogous to
GitHub Actions or GitLab CI; it's a continuous integration system.</p>
<p>CML is a toolkit that works with a continuous integration system to 1) provide
big data management (via DVC & cloud storage), 2) help you write model metrics
and data viz to comments in GitHub/Lab, and 3) orchestrate cloud resources for
model training and testing. Currently, CML is only available for GitHub Actions
and GitLab CI.</p>
<p>So to sum it up: CML is not a standalone continuous integration system! It's a
toolkit that works with existing systems, which in the future could include
Circle CI, Jenkins, Bamboo, Azure DevOps Pipelines, and Travis CI. Feel free to
<a href="https://github.com/iterative/cml/issues" target="_blank" rel="nofollow noopener noreferrer">open a feature request ticket</a>, or
leave a 👍 on open requests, to "vote" for the integrations you'd like to see
most.</p>https://dvc.org/blog/september-20-dvc-heartbeathttps://dvc.org/blog/september-20-dvc-heartbeatWed, 09 Sep 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="dmitry-on-software-engineering-daily" style="position:relative;">Dmitry on Software Engineering Daily<a href="#dmitry-on-software-engineering-daily" aria-label="dmitry on software engineering daily permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our CEO Dmitry Petrov was interviewed on the much-beloved Software Engineering
Daily podcast! Host <a href="https://twitter.com/the_prion" target="_blank" rel="nofollow noopener noreferrer">Jeff Meyerson</a> kicked off
the discussion:</p>
<blockquote>
<p>Code is version controlled through Git, the version control system originally
built to manage the Linux codebase. For decades, software has been developed
using git for version control. More recently, data engineering has become an
unavoidable facet of software development. It is reasonable to ask–why are we
not version controlling our data?</p>
</blockquote>
<p>For the rest of the episode, listen here!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://softwareengineeringdaily.com/2020/08/24/data-version-control-with-dmitry-petrov/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Version Control with Dmitry Petrov</h4>
<div class="elp-description"></div>
<div class="elp-link">softwareengineeringdaily.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/sedaily-3eb9a64f46034c9319af2b611a55202e.jpeg" alt="Data Version Control with Dmitry Petrov">
</div>
</a>
</section>
<p></p>
<h3 id="contributors-meetup" style="position:relative;">Contributor's meetup<a href="#contributors-meetup" aria-label="contributors meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Last week, we held a meetup for contributors to DVC! Core maintainer
<a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan Kupriev</a> hosted a get-together for folks who
contribute new features, bug fixes, and more to the community. If you missed it,
you can watch it on YouTube.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jUYSTERXxWg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We've released several new videos to our growing
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>- and
cool news, we passed 1,000 subscribers! The support has been surprising in the
best way possible. We're seeing a lot of repeat commenters and folks from the
DVC meetups! It's been so rewarding to get positive feedback from the community
and we're planning to build our YouTube presence even more.</p>
<p><img src="https://media.giphy.com/media/ZE0JppdERv8t4jVCAt/giphy.gif" alt="Happy GIF"></p>
<p><em>Even Skeletor finds joy in this.</em></p>
<p>We now have 4 tutorials in our MLOps series. In the latest, we cover how to use
your own GPU (on-premise or in the cloud) to run GitHub Actions workflows. Check
it out and give it a try, the code examples are freely available :)</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/rVq-SCNyxVc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>We also made our first ever "explainer" video to talk through how DVC works in
five minutes.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/UbL7VUpv1Bs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>As always, video requests are welcome! Reach out and let us know what topics and
tutorials you want to see covered. And we appreciate any likes, shares, and
subscribes on our growing YouTube channel.</p>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="a-three-part-cml-series-featuring-r" style="position:relative;">A three-part CML series (featuring R!)<a href="#a-three-part-cml-series-featuring-r" aria-label="a three part cml series featuring r permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC ambassador <a href="https://twitter.com/mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> has
published two of three tutorial blogs in a series on CML! Marcel's use case is
especially cool because he's using R, plus some causal modeling related to his
work in bioinformatics, with GitHub Actions.</p>
<p>In Part I, Marcel introduces his project and how he uses DVC, CML and GitHub
Actions together (with his custom R library).</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mribeirodantas.xyz/blog/index.php/2020/08/10/continuous-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Continuous Machine Learning - Part I</h4>
<div class="elp-description">by Marcel Ribeiro-Dantas</div>
<div class="elp-link">mribeirodantas.xyz</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/MLOps-8126305fe5b093898314fd6250f4b95c.png" alt="Continuous Machine Learning - Part I">
</div>
</a>
</section>
<p></p>
<p>In Part II, Marcel takes a deeper dive into Docker. He explains how to create a
your own Docker image and test it. This case should be helpful for folks who
want to include the CML library in their own Docker container.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mribeirodantas.xyz/blog/index.php/2020/08/18/continuous-machine-learning-part-ii/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Continuous Machine Learning - Part II</h4>
<div class="elp-description">by Marcel Ribeiro-Dantas</div>
<div class="elp-link">mribeirodantas.xyz</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/docker_logo-08a71e88bc63e58e0b64be1a87f46d19.png" alt="Continuous Machine Learning - Part II">
</div>
</a>
</section>
<p></p>
<h3 id="real-python-talks-dvc" style="position:relative;">Real Python talks DVC<a href="#real-python-talks-dvc" aria-label="real python talks dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://twitter.com/kristijan_ivanc" target="_blank" rel="nofollow noopener noreferrer">Kristijan Ivancic</a> of
<a href="realpython.com">Real Python</a>, a library of online Python tutorials and lessons,
created a <em>seriously</em> impressive DVC tutorial (this thing is a beast 🐺- it has
a table of contents!)</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0223cef84ac40de89a1dee595a176c51/39600/Real_Python.png" alt="Real Python" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>How cool is this artwork?</em></p>
<p>And, the Real Python podcast discussed their DVC tutorial (plus the joys of
version control for data!) on a recent episode.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://realpython.com/podcasts/rpp/25/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Episode 25: Data Version Control in Python and Real Python Video Transcripts</h4>
<div class="elp-description">The Real Python Podcast</div>
<div class="elp-link">realpython.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/podcast_log-3fcbb7ce3ba571bd42ef1742701fcda4.png" alt="Episode 25: Data Version Control in Python and Real Python Video Transcripts">
</div>
</a>
</section>
<p></p>
<h3 id="recommended-reading" style="position:relative;">Recommended reading<a href="#recommended-reading" aria-label="recommended reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There's a lot of cool stuff happening out there in the data science world 🌏!</p>
<ul>
<li><a href="https://twitter.com/fab_clemente" target="_blank" rel="nofollow noopener noreferrer">Fabiana Clemente</a>, Chief Data Officer of
<a href="https://ydata.ai/" target="_blank" rel="nofollow noopener noreferrer">YData</a>, published a blog for The Startup about four
reasons to start using data version control- and, with her expertise in data
privacy, she's especially well-qualified to explain the role of DVC in
compliance and auditing! Check out her blog (it comes with a quick-start
tutorial, too).</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">4 reasons why data scientists should version data</h4>
<div class="elp-description">How to start data versioning using DVC</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/fabiana-1a47c481b8ffec6781d7125892845083.jpg" alt="4 reasons why data scientists should version data">
</div>
</a>
</section>
<p></p>
<ul>
<li>Ryzal Kamis at the <a href="makerspace.aisingapore.org">AI Singapore Makerspace</a>
shared a blog (the first of two!) about creating end-to-end CI/CD workflows
for machine learning. In his first blog, Ryzal gives a high-level overview of
the need for data version control and compares several tools in the space.
Then he gives a walkthrough (quite easy to follow!) of how DVC fits in his
workflow. We're eagerly awaiting the second installment of this series, which
promises to bring more advanced automation scenarios and a CI/CD pipeline.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://makerspace.aisingapore.org/2020/08/data-versioning-for-cd4ml-part-1/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Versioning for CD4ML</h4>
<div class="elp-description">Part 1</div>
<div class="elp-link">makerspace.aisingapore.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/singapore-4ff5f8d09325f533e0f48806b348eb95.jpg" alt="Data Versioning for CD4ML">
</div>
</a>
</section>
<p></p>
<ul>
<li><a href="https://www.infoworld.com/author/Isaac-Sacolick/" target="_blank" rel="nofollow noopener noreferrer">Isaac Sacolick</a>,
contributing editor at InfoWorld, penned an article about the growing field of
MLOps and its role in data-driven businesses. He writes:</li>
</ul>
<blockquote>
<p>Too many data and technology implementations start with poor or no problem
statements and with inadequate time, tools, and subject matter expertise to
ensure adequate data quality. Organizations must first start with asking smart
questions about big data, investing in dataops, and then using agile
methodologies in data science to iterate toward solutions.</p>
</blockquote>
<p>Read the rest here:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.infoworld.com/article/3570716/mlops-the-rise-of-machine-learning-operations.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLops: The rise of machine learning operations</h4>
<div class="elp-description">Once machine learning models make it to production, they still need updates and monitoring for drift. A team to manage ML operations makes good business sense</div>
<div class="elp-link">infoworld.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-09-09/infoworld-f3a590c3134bfbf003256a42ae130b55.png" alt="MLops: The rise of machine learning operations">
</div>
</a>
</section>
<p></p>
<p>Thanks everyone, that's a wrap for this month. Be safe, stay in touch, and get
ready for pumpkin spice latte season 🎃.</p>
<p><img src="https://media.giphy.com/media/EDpVRPFK5bjfq/giphy.gif" alt="Cat Fall GIF"></p>https://dvc.org/blog/august-20-community-gemshttps://dvc.org/blog/august-20-community-gemsThu, 27 Aug 2020 00:00:00 GMT<p>Here are some of our top Q&A's from around the community. With the launch of
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> earlier in the month, we've got some new ground to cover!</p>
<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-whats-the-relationship-between-the-dvc-remote-and-cache-if-i-have-an-external-cache-do-i-really-need-a-dvc-remote" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/747588572479094866" target="_blank" rel="nofollow noopener noreferrer">Q: What's the relationship between the DVC remote and cache? If I have an external cache, do I really need a DVC remote?</a><a href="#q-whats-the-relationship-between-the-dvc-remote-and-cache-if-i-have-an-external-cache-do-i-really-need-a-dvc-remote" aria-label="q whats the relationship between the dvc remote and cache if i have an external cache do i really need a dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can think of your DVC remote similar to your Git remote, but for data and
model artifacts- it's a place to backup and share artifacts. It also gives you
methods to push and pull those artifacts to and from your team.</p>
<p>Your DVC cache (by default, it's located in <code>.dvc/cache</code>) serves a similar
purpose to your Git objects database (which is by default located in
<code>.git/objects</code>). They're both <em>local</em> caches that store files (including various
versions of them) in a content-addressable format, which helps you quickly
checkout different versions to your local workspace. The difference is that
<code>.dvc/cache</code> is for data/model artifacts, and <code>.git/objects</code> is for code.</p>
<p>Usually, your DVC remote is a superset of <code>.dvc/cache</code>- everything in your cache
is a copy of something in your remote (though there may be files in your DVC
remote that are not in your cache (and vice versa) if you have never attempted
to <code>push</code> or <code>pull</code> them locally).</p>
<p>In theory, if you are using an
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">external cache</a>-
meaning a DVC cache configured on a separate volume (like NAS, large HDD, etc.)
outside your project path- and all your projects and all your teammates use that
external cache, and you <em>know</em> that the storage is highly reliable, you don't
need to also have a DVC remote. If you have any doubts about access to your
external cache or its reliability, we'd recommend also keeping a remote.</p>
<h3 id="q-one-of-my-files-is-an-output-of-a-dvc-pipeline-and-i-want-to-track-this-file-with-git-and-store-it-in-my-git-repository-since-it-isnt-very-big-how-can-i-make-this-work" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/732308317627613235" target="_blank" rel="nofollow noopener noreferrer">Q: One of my files is an output of a DVC pipeline, and I want to track this file with Git and store it in my Git repository since it isn't very big. How can I make this work?</a><a href="#q-one-of-my-files-is-an-output-of-a-dvc-pipeline-and-i-want-to-track-this-file-with-git-and-store-it-in-my-git-repository-since-it-isnt-very-big-how-can-i-make-this-work" aria-label="q one of my files is an output of a dvc pipeline and i want to track this file with git and store it in my git repository since it isnt very big how can i make this work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! There are two approaches. We'll be assuming you have a pipeline stage that
outputs a file, <code>myfile</code>.</p>
<ul>
<li>If you haven't declared the pipeline stage with <code>dvc run</code> yet, then you'll do
it like this:</li>
</ul>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> <span class="token operator"><</span>stage name<span class="token operator">></span> <span class="token parameter variable">-d</span> <span class="token operator"><</span>dependency<span class="token operator">></span> <span class="token parameter variable">-O</span> myfile</span></code></pre></div>
<p>Note that instead of using the flag <code>-o</code> for specifying the output <code>myfile</code>,
we're using <code>-O</code>- it's shorthand for <code>--outs-no-cache</code>. You can
<a href="https://dvc.org/doc/command-reference/run#options" target="_blank" rel="nofollow noopener noreferrer">read about this flag in our docs</a>.</p>
<ul>
<li>If you've already created your pipeline stage, go into your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> and
manually add the field <code>cache: false</code> to the stage as follows:</li>
</ul>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">myfile</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div>
<p>Please note one special case: if you previously enabled hardlinks or symlinks in
DVC via <a href="https://dvc.org/doc/command-reference/config"><code>dvc config cache</code></a>, you may need to run <a href="https://dvc.org/doc/command-reference/unprotect"><code>dvc unprotect myfile</code></a> to fully
unlink <code>myfile</code> from your DVC cache. If you haven't enabled these types of file
links (and if you're not sure, <em>you probably didn't!</em>), this step is unncessary.
<a href="https://dvc.org/doc/command-reference/unprotect" target="_blank" rel="nofollow noopener noreferrer">See our docs for more.</a></p>
<h3 id="q-can-i-change-my-paramsyaml-file-to-a-json" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/730614265051873370" target="_blank" rel="nofollow noopener noreferrer">Q: Can I change my <code>params.yaml</code> file to a <code>.json</code>?</a><a href="#q-can-i-change-my-paramsyaml-file-to-a-json" aria-label="q can i change my paramsyaml file to a json permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, this is straightforward- you change your <code>params.yaml</code> to <code>params.json</code> in
your workspace, and then use it in <code>dvc run</code>:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-p</span> params.json:myparam <span class="token punctuation">..</span>.</span></code></pre></div>
<p>Alternately, if your pipeline stage has already been created, you can manually
edit your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file to replace <code>params.yaml</code> with <code>params.json</code>.</p>
<p>For more about the <code>params.yaml</code> file,
<a href="https://dvc.org/doc/start/experiments#defining-parameters" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p>
<h3 id="q-is-there-a-guide-for-migrating-from-git-lfs-to-dvc" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/743559246599421974" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a guide for migrating from Git-LFS to DVC?</a><a href="#q-is-there-a-guide-for-migrating-from-git-lfs-to-dvc" aria-label="q is there a guide for migrating from git lfs to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We don't know of any published guide. One of our users shared their procedure
for disabling LFS:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> lfs uninstall
</span><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> .gitattributes
</span><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> .lfsconfig</span></code></pre></div>
<p>Then you can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> files you wish to put in DVC tracking, and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>
them to your remote. After that, <code>git commit</code> and you're good!</p>
<p>Note that, if you're going to delete any LFS files, make sure you're certain the
corresponding data has been transferred to DVC.</p>
<h3 id="q-is-there-a-way-to-use-dvc-and-cml-to-validate-a-model-in-a-github-action-without-making-the-validation-data-available-to-the-user-opening-the-pull-request" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/739202123295883325" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to use DVC and CML to validate a model in a GitHub Action, without making the validation data available to the user opening the Pull Request?</a><a href="#q-is-there-a-way-to-use-dvc-and-cml-to-validate-a-model-in-a-github-action-without-making-the-validation-data-available-to-the-user-opening-the-pull-request" aria-label="q is there a way to use dvc and cml to validate a model in a github action without making the validation data available to the user opening the pull request permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We don't have special support for this use case, and there may be some security
downsides to using a confidential validation dataset with someone else's code
(be sure nothing in their code could expose your data!). But, there are ways to
implement this if you're sure about it.</p>
<p>One possible approach is to create a separate "data registry" repository using a
private cloud bucket to store your validation dataset
(<a href="https://dvc.org/doc/use-cases/data-registries#data-registries" target="_blank" rel="nofollow noopener noreferrer">see our docs about the why and how of data registries</a>).
Your CI system can be setup to have access to the data registry via secrets
(called "variables" in GitLab). Then when you run validation via
<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro validate</code></a>, you could use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> to pull the private data from the
registry.</p>
<p>The data is never exposed to the user in an interactive setting, only on the
runner- and there it's ephemeral, meaning it does not exist once the runner
shuts down.</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-sometimes-when-i-make-a-commit-on-a-branch-my-ci-workflow-isnt-triggered-whats-going-on" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=UgwKIYsCo194AErdeBJ4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Sometimes when I make a commit on a branch, my CI workflow isn't triggered. What's going on?</a><a href="#q-sometimes-when-i-make-a-commit-on-a-branch-my-ci-workflow-isnt-triggered-whats-going-on" aria-label="q sometimes when i make a commit on a branch my ci workflow isnt triggered whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If your workflow is set to trigger on a push (as in the CML use cases), it isn't
enough to <code>git commit</code> locally- you need to push to your GitHub or GitLab
repository. If you want every commit to trigger your workflow, you'll need to
push each one!</p>
<p>What about if you <em>don't</em> want a push to trigger your worfklow? In GitLab, you
can use the
<a href="https://docs.gitlab.com/ee/ci/yaml/#skip-pipeline" target="_blank" rel="nofollow noopener noreferrer"><code>[ci skip]</code> flag</a>- make sure
your commit message contains <code>[ci skip]</code> or <code>[skip ci]</code>, and GitLab CI won't run
the pipeline in your <code>gitlab-ci.yml</code> file.</p>
<p>In GitHub Actions, this flag isn't supported, so you can manually kill any
workflows in the Actions dashboard. For a programmatic fix,
<a href="https://timheuer.com/blog/skipping-ci-github-actions-workflows/" target="_blank" rel="nofollow noopener noreferrer">check out this workaround by Tim Heuer</a>.</p>
<h3 id="q-can-i-do-the-bulk-of-my-model-training-outside-of-my-ci-system-and-then-share-the-result-with-cml" style="position:relative;"><a href="https://twitter.com/peterkuai/status/1295899690404175872" target="_blank" rel="nofollow noopener noreferrer">Q: Can I do the bulk of my model training outside of my CI system, and then share the result with CML?</a><a href="#q-can-i-do-the-bulk-of-my-model-training-outside-of-my-ci-system-and-then-share-the-result-with-cml" aria-label="q can i do the bulk of my model training outside of my ci system and then share the result with cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Definitely! This is a desirable workflow in several cases:</p>
<ul>
<li>You have a preferred approach for experiment tracking (for example, DVC or
MLFlow) that you want to keep using</li>
<li>You don't want to set up a self-hosted runner to connect your computing
resources to GitHub or GitLab</li>
<li>Training time is on the order of days or more</li>
</ul>
<p>CML is very flexible, and one strong use case is for sanity checking and
evaluating a model in a CI system post-training. When you have a model that
you're satisifed with, you can check it into your CI system and use CML to
evaluate the model in a production-like environment (such as a custom Docker
container), report its behavior and informative metrics. Then you can decide if
it's ready to be merged into your main branch.</p>
<h3 id="q-can-i-make-a-cml-report-comparing-models-across-different-branches-of-a-project" style="position:relative;"><a href="https://github.com/iterative/cml/issues/188" target="_blank" rel="nofollow noopener noreferrer">Q: Can I make a CML report comparing models across different branches of a project?</a><a href="#q-can-i-make-a-cml-report-comparing-models-across-different-branches-of-a-project" aria-label="q can i make a cml report comparing models across different branches of a project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Definitely. This is what <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a> is for- like a <code>git diff</code>, but for
model metrics instead of code. We made a video about how to do this in CML!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/xPncjKH6SPk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="q-in-the-function-cml-publish-it-looks-like-youre-uploading-published-files-to-httpsassetcmldev-why-dont-you-just-save-images-in-the-git-repository" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/745168931521822740" target="_blank" rel="nofollow noopener noreferrer">Q: In the function <code>cml publish</code>, it looks like you're uploading published files to <code>https://asset.cml.dev</code>. Why don't you just save images in the Git repository?</a><a href="#q-in-the-function-cml-publish-it-looks-like-youre-uploading-published-files-to-httpsassetcmldev-why-dont-you-just-save-images-in-the-git-repository" aria-label="q in the function cml publish it looks like youre uploading published files to httpsassetcmldev why dont you just save images in the git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If an image file is created as part of your workflow, it's ephemeral- it doesn't
exist outside of your CI runner, and will disappear when your runner is shut
down. To include an image in a GitHub or GitLab comment, a link to the image
needs to persist. You could commit the image to your repository, but typically,
<a href="https://stackoverflow.com/questions/61245284/is-it-necessary-to-commit-dvc-files-from-our-ci-pipelines" target="_blank" rel="nofollow noopener noreferrer">it's undesireable to automatically commit results of a CI workflow</a>.</p>
<p>We created a publishing service to help you host files for CML reports. Under
the hood, our service uploads your file to an S3 bucket and uses a key-value
store to share the file with you.</p>
<p>This covers a lot of cases, but if the files you wish to publish can't be shared
with our service for security or privacy reasons, you can emulate the
<code>cml publish</code> function with your own storage. You would push your file to
storage and include a link to its address in your markdown report.</p>https://dvc.org/blog/august-20-dvc-heartbeathttps://dvc.org/blog/august-20-dvc-heartbeatMon, 10 Aug 2020 00:00:00 GMT<p>Welcome to our August roundup of cool news, new releases, and recommended
reading in the MLOps world!</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="cml-release" style="position:relative;">CML release<a href="#cml-release" aria-label="cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>At the beginning of July, we went live with a new project:
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning, or CML</a> for short. If you
hadven't heard, CML is an open-source toolkit for adapting popular continuous
integration systems like GitHub Actions and GitLab CI for machine learning and
data science. This release marks a new stage for our organization: while CML can
work with DVC, and both are built around Git, CML is designed for standalone
use. That means we're supporting TWO projects now!</p>
<p><img src="https://media.giphy.com/media/X5i2BoQeD9kWY/giphy.gif" alt="Threaten Ashley Olsen GIF"></p>
<p>Luckily, we received plenty of encouraging and helpful feedback following the
CML release. CML was on the front page of Hacker News for most of release day!
We also got
<a href="https://www.heise.de/news/Machine-Learning-CML-schickt-Daten-und-Modelltraining-in-die-Pipeline-4841023.html" target="_blank" rel="nofollow noopener noreferrer">covered on Heise</a>,
a popular German IT news source. I (Elle, a proud part of the CML team!) also
gave a talk presenting our approach as part of the MLOps World meeting, which is
now available for online viewing.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/yp0su5mOeko?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>Of course, we're fielding lots of questions too! We've compiled some of the most
common questions (and their answers!) in our last
<a href="https://dvc.org/blog/july-20-community-gems" target="_blank" rel="nofollow noopener noreferrer">Community Gems post</a>, and CML
developer <a href="https://github.com/DavidGOrtega" target="_blank" rel="nofollow noopener noreferrer">David G. Ortega</a> has written a
tutorial for a much-asked-for use case: doing
<a href="https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus" target="_blank" rel="nofollow noopener noreferrer">continuous integration with on-demand GPUs</a>.</p>
<p>If you have comments, questions, or feature requests about CML, we <em>really</em> want
to hear from you. A few ways to be in touch:</p>
<ul>
<li>Open an <a href="https://github.com/iterative/cml/issues" target="_blank" rel="nofollow noopener noreferrer">issue on the project repo</a></li>
<li>Drop by the <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">CML Discord channel</a></li>
<li>Send us <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">an email</a></li>
</ul>
<h3 id="july-meetup" style="position:relative;">July Meetup<a href="#july-meetup" aria-label="july meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Last week, we had another meetup!
<a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">DVC Ambassador Marcel</a> kicked us off with a short
talk about how he's using DVC as part of his causal modeling approach to
bioinformatics. It's cool stuff. Then, I talked a bit about CML and did some
live-coding. The beauty of live-coding is getting to answer questions in
real-time, and if you're totally new to the idea of continuous integration (or
want to understand how CML works with GitHub Actions/GitLab CI) seeing a project
in-action is one of the best ways to learn.</p>
<p>You can watch a recording of the meetup online now (it's lightly edited to
remove some pesky Zoom trolls), and
<a href="https://www.meetup.com/DVC-Community-Virtual-Meetups" target="_blank" rel="nofollow noopener noreferrer">join our Meetup group</a> to
get updates for the next one. In future meetups, we'd love to support community
members sharing their work, so get in touch if you'd like to present.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/tnTPHG5seDs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h3 id="new-video-series" style="position:relative;">New video series<a href="#new-video-series" aria-label="new video series permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We're starting up some new YouTube features! If you haven't seen our channel,
<a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">check it out and consider subscribing</a>
for hands-on tutorials and demos. Our
<a href="https://youtu.be/9BgIDqAzfuA" target="_blank" rel="nofollow noopener noreferrer">first video introduced continuous integration and GitHub Actions</a>,
and the second showed
<a href="https://youtu.be/kZKAuShWF0s" target="_blank" rel="nofollow noopener noreferrer">how to use DVC and free Google Drive storage to add external data storage to a GitHub project</a>.</p>
<p>In the coming weeks, we'll be covering:</p>
<ul>
<li>Using CML and GitHub Actions with hardware for deep learning, like on-premise
GPUs</li>
<li>Understanding Vega plots and making data viz part of your CI system</li>
<li>Some DVC basics to supplement our docs</li>
</ul>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="spacy--dvc--️" style="position:relative;">SpaCy + DVC = ❤️<a href="#spacy--dvc--%EF%B8%8F" aria-label="spacy dvc ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We're huge fans of a recent Python Bytes episode featuring
<a href="https://twitter.com/_inesmontani" target="_blank" rel="nofollow noopener noreferrer">Ines Montani</a>, founder of Explosion and one
of the makers of the incredible SpaCy library for NLP (seriously, I have the
highest recommendations for SpaCy).</p>
<blockquote>
<p>My <a href="https://twitter.com/pythonbytes" target="_blank" rel="nofollow noopener noreferrer">@PythonBytes</a> episode is out now!</p>
<p>🎙️ Listen here: <a href="https://t.co/fHLF2hR4cM" target="_blank" rel="nofollow noopener noreferrer">https://t.co/fHLF2hR4cM</a></p>
<p>My picks of the week are:<br>
🐙 TextAttack by @jxmorris12:
<a href="https://t.co/jySYrtzzp8" target="_blank" rel="nofollow noopener noreferrer">https://t.co/jySYrtzzp8</a><br>
🦉 Data Version Control (DVC) <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a>:
<a href="https://t.co/3610F6kv8v" target="_blank" rel="nofollow noopener noreferrer">https://t.co/3610F6kv8v</a><br>
🐍 Built-in generic types in 3.9</p>
<p>— Ines Montani 〰️ (@_inesmontani)
<a href="https://twitter.com/_inesmontani/status/1286222512762871808" target="_blank" rel="nofollow noopener noreferrer">July 23, 2020</a></p>
</blockquote>
<p>Ines' episode discussed DVC, and DVC is going to be integrated with SpaCy in
their 3.0 release. SpaCy + DVC is going to be a powerhouse and we can't wait.</p>
<h3 id="take-a-stab-at-shtab" style="position:relative;">Take a stab at shtab<a href="#take-a-stab-at-shtab" aria-label="take a stab at shtab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Another cool software project: <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a>, DVC
contributor and creator of the popular
<a href="https://github.com/tqdm/tqdm" target="_blank" rel="nofollow noopener noreferrer">tqdm library</a>, has published a tab-completion
script generator for Python applications! <code>shtab</code>, as it's called, was
originally designed for DVC, but Casper developed it into a generic tool that
can be used for virtually any Python CLI application. Check out
<a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code> on GitHub</a> and read the release
blog.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dvc.org/blog/shtab-completion-release" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">(Tab) Complete Any Python Application in 1 Minute or Less</h4>
<div class="elp-description">We've made a painless tab-completion script generator for Python applications!</div>
<div class="elp-link">dvc.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-08-10/shtab-63dfef1b63f0d3983a998c2f2a37e6fe.png" alt="(Tab) Complete Any Python Application in 1 Minute or Less">
</div>
</a>
</section>
<p></p>
<h3 id="dvc-10-migration-script" style="position:relative;">DVC 1.0 migration script<a href="#dvc-10-migration-script" aria-label="dvc 10 migration script permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our friends at <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> have released a script to help
DVC users upgrade their pipelines to the new DVC 1.0 format! Says Simon, a
DAGsHub engineer, in his tutorial:</p>
<blockquote>
<p>In this post, I'll walk you through the process of migrating your existing
project from DVC ≤ 0.94 to DVC 1.X using a single automated script, and then
demonstrate a way to check that your migration was successful.</p>
</blockquote>
<p>Read the blog and get migrating (but don't worry if you can't; DVC 1.0 is
backwards compatible).
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/automatically-migrate-your-project-from-dvc-0-94-to-dvc-1-x-416a5b9e837b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Automatically migrate your project from DVC≤ 0.94 to DVC 1.x</h4>
<div class="elp-description">Migrating your project from DVC ≤ 0.94 to DVC 1.x can be a very involved process. Here’s an easy way to do it.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-08-10/dagshub-d94acab82a6d235462cf66823321303b.jpg" alt="Automatically migrate your project from DVC≤ 0.94 to DVC 1.x">
</div>
</a>
</section>
<p></p>
<h3 id="recommended-reading" style="position:relative;">Recommended reading<a href="#recommended-reading" aria-label="recommended reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Here are some of our favorite blogs from around the internet 🌏.</p>
<ul>
<li><a href="https://deborahmesquita.com/" target="_blank" rel="nofollow noopener noreferrer">Déborah Mesquita</a>, data scientist (and an
excellent writer to follow), published a tutorial about DVC pipelines that is
truly deserving of the moniker "ultimate guide". It's a start-to-finish case
study about a typical machine learning project, with DVC pipelines to automate
everything from grabbing the data to training and evaluating a model. Also, it
comes with a video tutorial if you prefer to watch instead of read!</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The ultimate guide to building maintainable Machine Learning pipelines using DVC</h4>
<div class="elp-description">Learn the principles for building maintainable Machine Learning pipelines using DVC</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-08-10/deborah-fb09cdac9dbd7a3985fab4a3f06e83fb.jpg" alt="The ultimate guide to building maintainable Machine Learning pipelines using DVC">
</div>
</a>
</section>
<p></p>
<ul>
<li>Software engineer
<a href="https://www.linkedin.com/in/vaithyanathan/" target="_blank" rel="nofollow noopener noreferrer">Vaithy Narayanan</a> created the
first ever ☝️ CML user blog! Vaithy created a pipeline that covers data
collection to model training and testing, and used CML to automate the
pipeline execution whenever the project's GitHub repository is updated. He
ends with some insightful discussion about the strengths and weaknesses of the
approach.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@karthik.vaithyanathan/using-continuous-machine-learning-to-run-your-ml-pipeline-eeeeacad69a3" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Using Continuous Machine Learning to Run Your ML Pipeline</h4>
<div class="elp-description">Vaithy Narayanan</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-08-10/vaithy-12db755ef9cb1d18c60fe6d502f8f454.jpg" alt="Using Continuous Machine Learning to Run Your ML Pipeline">
</div>
</a>
</section>
<p></p>
<ul>
<li>
<p><a href="https://www.linkedin.com/in/ryan-w-gross/" target="_blank" rel="nofollow noopener noreferrer">Ryan Gross</a>, a VP at Pariveda
Solutions, blogged about the future of data governance and the lessons from
DevOps that might save the day. Honestly, you should probably start reading
for this cover image alone.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d1678417f2c4f696d8be116ddab483b4/39600/dataops.png" alt="dataops" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DataOps is accurately depicted
as a badass flaming eagle.</em> Check out the blog here:</p>
</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The Rise of DataOps (from the ashes of Data Governance)</h4>
<div class="elp-description">Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive orders-of-magnitude improvements.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-08-10/ryan-e96231d9b6f89548cf406226f82782a8.png" alt="The Rise of DataOps (from the ashes of Data Governance)">
</div>
</a>
</section>
<p></p>
<p>And, there's a
<a href="https://locallyoptimistic.com/post/git-for-data-not-a-silver-bullet/?utm_campaign=Data_Elixir&utm_source=Data_Elixir_298" target="_blank" rel="nofollow noopener noreferrer">noteworthy counterpoint</a>
by
<a href="https://www.linkedin.com/in/michael-the-data-guy-kaminsky/" target="_blank" rel="nofollow noopener noreferrer">Michael Kaminsky</a>.
Read them both!</p>
<p>Thanks everyone, that's it for this month. We hope you're staying safe and
making cool things!</p>
<p><img src="https://media.giphy.com/media/35EsMpEfGHkVoHbNTU/giphy.gif" alt="Reaction GIF by MOODMAN"></p>https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpushttps://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpusFri, 07 Aug 2020 00:00:00 GMT<p>When creating your CI/CD workflow for a machine learning (ML) project, you might
find that by default, neither GitHub Actions nor GitLab CI provides the
computing capabilities you need- like GPUs, high memory instances, or multiple
cores.</p>
<p>To overcome this hardware hurdle, one practical approach is to use self-hosted
runners: runners that you manage, but are accessible to your CI/CD system for
executing jobs. It could be an EC2 instance or the GPU under your desk. In our
<a href="https://dvc.org/blog/cml-release" target="_blank" rel="nofollow noopener noreferrer">recently-released project</a>, Continuous
Machine Learning (CML), our Docker image acts as a thin wrapper over GitLab and
GitHub runners, adding some extra capabilities.</p>
<p>Here are some benefits of using CML with a self-hosted runner:</p>
<ol>
<li>
<p><strong>Easy to use.</strong> Working the same way for both GitLab and GitHub.</p>
</li>
<li>
<p><strong>Get out of dependency hell.</strong> We tend to install packages (on top of
packages, on top of packages…) while we‘re experimenting with models. In ML
in particular, we can be dependent on drivers AND libraries, and sometimes
precise versions of them (CUDA and TensorFlow, anyone?). Your CI workflow
will install all the dependencies in the containerised runner leaving your
machine always clean.</p>
</li>
<li>
<p><strong>Security.</strong> If your repo is public your runners could be accessed by
anyone that could add
<a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories" target="_blank" rel="nofollow noopener noreferrer">scripts that exploits your machine</a>.
With the containerised runner you are restricting the access to your real
machine.</p>
</li>
<li>
<p><strong>Gain reproducibility.</strong> One of the biggest technical debts in the ML space
is reproducibility. A few weeks post-experiment, we often discover that
trying to put your model back in shape is a pain. Looking at our repo, it’s
not obvious what data or training infrastructure or dependencies went into a
given result. When you move your ML experiments into a CI/CD system you are
making a contract of the dependencies and hardware used for your experiment.
Having that contract isolated by the containerised runner, your experiment
is perfectly reproducible by anyone in the future.</p>
</li>
</ol>
<h2 id="hands-on-gpu-self-hosted-runners-101" style="position:relative;">Hands on GPU Self-hosted runners 101<a href="#hands-on-gpu-self-hosted-runners-101" aria-label="hands on gpu self hosted runners 101 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="1-install-nvidia-drivers-and-nvidia-docker-in-your-machine-ubuntu-1804" style="position:relative;">1) Install nvidia drivers and nvidia-docker in your machine (ubuntu 18.04)<a href="#1-install-nvidia-drivers-and-nvidia-docker-in-your-machine-ubuntu-1804" aria-label="1 install nvidia drivers and nvidia docker in your machine ubuntu 1804 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">curl</span> <span class="token parameter variable">-s</span> <span class="token parameter variable">-L</span> https://nvidia.GitHub.io/nvidia-docker/gpgkey <span class="token operator">|</span> <span class="token function">sudo</span> apt-key <span class="token function">add</span> - <span class="token operator">&&</span> <span class="token punctuation">\</span>
<span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token parameter variable">-L</span> https://nvidia.GitHub.io/nvidia-docker/ubuntu18.04/nvidia-docker.list <span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> /etc/apt/sources.list.d/nvidia-docker.list <span class="token operator">&&</span> <span class="token punctuation">\</span>
<span class="token function">sudo</span> <span class="token function">apt</span> update <span class="token operator">&&</span> <span class="token function">sudo</span> <span class="token function">apt</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> ubuntu-drivers-common <span class="token operator">&&</span> <span class="token punctuation">\</span>
<span class="token function">sudo</span> ubuntu-drivers autoinstall <span class="token operator">&&</span> <span class="token punctuation">\</span>
<span class="token function">sudo</span> <span class="token function">apt</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> nvidia-container-toolkit <span class="token operator">&&</span> <span class="token punctuation">\</span>
<span class="token function">sudo</span> systemctl restart <span class="token function">docker</span></span></code></pre></div>
<p>You can test that your gpus are up and running with the following command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run <span class="token parameter variable">--gpus</span> all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi</span></code></pre></div>
<p>We should see something like this:
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 594px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9ba66893a70af142402136bb0861e501/39600/nvidia-smi-output.png" alt="nvidia smi output" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="2-start-your-self-hosted-runner" style="position:relative;">2) Start your self-hosted runner<a href="#2-start-your-self-hosted-runner" aria-label="2 start your self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>With CML docker images launching your own self-hosted runner is very easy. These
images have CML and DVC preinstalled (among other perks), plus CUDA drivers.
That's all. You can clone these images and add your own dependencies to better
mimic your own production environment.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run <span class="token parameter variable">--name</span> myrunner <span class="token parameter variable">-d</span> <span class="token parameter variable">--gpus</span> all <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_IDLE_TIMEOUT</span><span class="token operator">=</span><span class="token number">1800</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_LABELS</span><span class="token operator">=</span>cml,gpu <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_REPO</span><span class="token operator">=</span><span class="token variable">$my_repo_url</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-e</span> <span class="token assign-left variable">repo_token</span><span class="token operator">=</span><span class="token variable">$my_repo_token</span> <span class="token punctuation">\</span>
iterativeai/cml:0-dvc2-base1-gpu runner</span></code></pre></div>
<p>where:</p>
<p><code>RUNNER_IDLE_TIMEOUT</code> is the time in seconds that the runner is going to be idle
at most waiting for jobs to come, if no one comes the runner shuts down and
unregisters from your repo.</p>
<p><code>RUNNER_LABELS</code> a comma delimited list of labels that we are setting in our
workflow that the jobs will wait for.</p>
<p><code>RUNNER_REPO</code> is the url of your GitLab or GitHub repo. repo_token is the
personal token generated for your GitHub or GitLab repo. Note that for GitHub
you must check <code>workflow</code> along with <code>repo</code>.</p>
<p>If everything went fine we should see a runner registered in our repo.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0f4c6f8d9921fd73fe754a73ae76b04e/39600/registered-cml-runner-github.png" alt="registered cml runner github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 459px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bff63d0fe6b853a4c80d71ec496f5b4a/39600/registered-cml-runner-gitlab.png" alt="registered cml runner gitlab" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="3-setup-your-github-actions-or-gitlab-workflow-yaml-file-to-use-the-runner-and-commit-your-changes" style="position:relative;">3) Setup your GitHub Actions or GitLab workflow yaml file to use the runner and commit your changes.<a href="#3-setup-your-github-actions-or-gitlab-workflow-yaml-file-to-use-the-runner-and-commit-your-changes" aria-label="3 setup your github actions or gitlab workflow yaml file to use the runner and commit your changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>GitLab</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">tags</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> cml
<span class="token punctuation">-</span> gpu
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> echo 'Hi from CML<span class="token tag">!'</span> <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment report.md</code></pre></div>
<p>GitHub</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">,</span> gpu<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> cml_run
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
echo 'Hi from CML!' >> report.md
cml send-comment report.md</span></code></pre></div>
<p>Congrats! At this point you have done all the steps to have your GPUs up and
running with CML.</p>
<h1 id="limitations-and-future-directions" style="position:relative;">Limitations and future directions<a href="#limitations-and-future-directions" aria-label="limitations and future directions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>There are still some limitations to be solved at this stage:</p>
<ul>
<li>
<p>GitHub Actions
<a href="https://docs.github.com/en/actions/getting-started-with-github-actions/about-github-actions#usage-limits" target="_blank" rel="nofollow noopener noreferrer">can’t run a workflow longer than 72 hours</a>.</p>
</li>
<li>
<p>Self-hosted runners
<a href="https://GitLab.com/GitLab-org/GitLab/-/issues/229851#note_390371734" target="_blank" rel="nofollow noopener noreferrer">don’t behave well when they disconnect from the repo</a>,
limiting the possibilities with preemptible instances (also known as spot
instances).</p>
</li>
</ul>
<p>We’re working on these issues (see issues
<a href="https://github.com/iterative/cml/issues/161" target="_blank" rel="nofollow noopener noreferrer">#161</a>,
<a href="https://github.com/iterative/cml/issues/174" target="_blank" rel="nofollow noopener noreferrer">#174</a>, and
<a href="https://github.com/iterative/cml/issues/208" target="_blank" rel="nofollow noopener noreferrer">#208</a>) both in terms of CML and
DVC capabilities. So keep watching this space for updates!</p>
<hr>
<p>We started CML to help teams deal with the complexity of ML more effectively-
continuous integration is a proven approach to keeping projects agile even as
the team size, number of experiments, and number of dependencies increase.
Treating experiments like potential new features in a software project opens up
many possibilities for improving our engineering practices. We’re looking
forward to an era when ML experiments can be created, logged, and merged into
production-ready code in minutes, not days or weeks.</p>https://dvc.org/blog/july-20-community-gemshttps://dvc.org/blog/july-20-community-gemsFri, 31 Jul 2020 00:00:00 GMT<p>Here are some of our top Q&A's from around the community. With the launch of
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> earlier in the month, we've got some new ground to cover!</p>
<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-recently-i-set-up-a-global-dvc-remote-where-can-i-find-the-config-file" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/717673618217238598" target="_blank" rel="nofollow noopener noreferrer">Q: Recently, I set up a global DVC remote. Where can I find the config file?</a><a href="#q-recently-i-set-up-a-global-dvc-remote-where-can-i-find-the-config-file" aria-label="q recently i set up a global dvc remote where can i find the config file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When you
<a href="https://dvc.org/doc/command-reference/remote/list#options" target="_blank" rel="nofollow noopener noreferrer">create a global DVC remote</a>,
a config file will be created in <code>~/.config/dvc/config</code> instead of your project
directory (i.e., <code>.dvc/config</code>).</p>
<p>Note that on a Windows system, the config file will be created at
<code>C:\Users\<username>\AppData\Local\iterative\dvc\config</code>.</p>
<h3 id="q-im-working-on-a-collaborative-project-and-i-use-dvc-pull-to-sync-my-local-workspace-with-the-project-repository-then-i-try-running-dvc-repro-but-get-an-error-dvcyaml-does-not-exist-no-one-else-on-my-team-is-having-this-issue-any-ideas" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/731188065078345799" target="_blank" rel="nofollow noopener noreferrer">Q: I'm working on a collaborative project, and I use <code>dvc pull</code> to sync my local workspace with the project repository. Then, I try running <code>dvc repro</code>, but get an error: <code>dvc.yaml does not exist</code>. No one else on my team is having this issue. Any ideas?</a><a href="#q-im-working-on-a-collaborative-project-and-i-use-dvc-pull-to-sync-my-local-workspace-with-the-project-repository-then-i-try-running-dvc-repro-but-get-an-error-dvcyaml-does-not-exist-no-one-else-on-my-team-is-having-this-issue-any-ideas" aria-label="q im working on a collaborative project and i use dvc pull to sync my local workspace with the project repository then i try running dvc repro but get an error dvcyaml does not exist no one else on my team is having this issue any ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This error suggests there is no <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file in your project. Most likely,
this means your teammates are using DVC version 0.94 or earlier, before the
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> standard was introduced. Meanwhile, it sounds like you're using
version 1.0 or later. You can check by running</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc version</span></span></code></pre></div>
<p>The best solution is for your whole team to upgrade to the latest version- and
there's an easy
<a href="https://towardsdatascience.com/automatically-migrate-your-project-from-dvc-0-94-to-dvc-1-x-416a5b9e837b" target="_blank" rel="nofollow noopener noreferrer">migration script to help you make the move</a>.
If for some reason this won't work for your team, you can either downgrade to a
previous version, or use a workaround:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> <span class="token operator"><</span>.dvc file<span class="token operator">></span></span></code></pre></div>
<p>substituting the appropriate <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file for your pipeline. DVC 1.0 is backwards
compatible, so pipelines created with previous versions will still run.</p>
<h3 id="q-does-the-dvc-installer-for-windows-also-include-the-dependencies-for-using-cloud-storage-like-s3-and-gcp" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/715717911574216735" target="_blank" rel="nofollow noopener noreferrer">Q: Does the DVC installer for Windows also include the dependencies for using cloud storage, like S3 and GCP?</a><a href="#q-does-the-dvc-installer-for-windows-also-include-the-dependencies-for-using-cloud-storage-like-s3-and-gcp" aria-label="q does the dvc installer for windows also include the dependencies for using cloud storage like s3 and gcp permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you're installing DVC from binary-such as the <code>dvc.exe</code>
<a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">downloadable on the DVC homepage</a>- all the standard
dependencies are included. You shouldn't need to use <code>pip</code> to install extra
packages (like <code>boto</code> for S3 storage).</p>
<h3 id="q-is-there-a-way-to-setup-my-dvc-remote-so-i-can-manually-download-files-from-it-without-going-through-dvc" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/717458695709130764" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to setup my DVC remote so I can manually download files from it without going through DVC?</a><a href="#q-is-there-a-way-to-setup-my-dvc-remote-so-i-can-manually-download-files-from-it-without-going-through-dvc" aria-label="q is there a way to setup my dvc remote so i can manually download files from it without going through dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When DVC adds a file to a remote repository (such as an S3 bucket, or an SSH
file server), there's only one change happening: DVC calculates an md5 for the
file and renames it with that md5. In technical terms, it's storing files in a
"content-addressable way". That means if you know the hash of a file, you can
locate it in your DVC remote and manually download it.</p>
<p>To find the hash for a given file, say <code>data.csv</code>, you can look in the
corresponding DVC file:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.csv.dvc</span></code></pre></div>
<p>Another approach is using a built-in DVC function:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--show-url</span> <span class="token builtin class-name">.</span> data.csv</span></code></pre></div>
<p>You can read more about <a href="https://dvc.org/doc/command-reference/get#--show-url"><code>dvc get --show-url</code></a> in
<a href="https://dvc.org/doc/command-reference/get#options" target="_blank" rel="nofollow noopener noreferrer">our docs</a>. Note that this
functinality is also part of our Python API, so you can locate the path to a
file in your remote within a Python environment.
<a href="https://dvc.org/doc/api-reference/get_url" target="_blank" rel="nofollow noopener noreferrer">Check out our API docs!</a></p>
<h3 id="q-by-default-each-dvc-project-has-its-own-cache-in-the-project-repository-to-save-space-im-thinking-about-locally-creating-a-single-cache-folder-and-letting-multiple-project-repositories-point-there-will-this-work" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/736164141701791815" target="_blank" rel="nofollow noopener noreferrer">Q: By default, each DVC project has its own cache in the project repository. To save space, I'm thinking about locally creating a single cache folder and letting multiple project repositories point there. Will this work?</a><a href="#q-by-default-each-dvc-project-has-its-own-cache-in-the-project-repository-to-save-space-im-thinking-about-locally-creating-a-single-cache-folder-and-letting-multiple-project-repositories-point-there-will-this-work" aria-label="q by default each dvc project has its own cache in the project repository to save space im thinking about locally creating a single cache folder and letting multiple project repositories point there will this work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, we hear from many users who have created a
<a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache#configure-the-shared-cache" target="_blank" rel="nofollow noopener noreferrer">shared cache</a>.
Because of the way DVC uses content-addressable filenames, you won't encounter
issues like accidentally overwriting files from one project with another.</p>
<p>A possible issue is that a shared cache will grant all teammates working on a
given project access to the data from all other projects using that cache. If
you have sensitive data, you can create different caches for projects involving
private and public data.</p>
<p>To learn more about setting your cache directory location,
<a href="https://dvc.org/doc/command-reference/cache/dir" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p>
<h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="q-i-use-bitbucket-will-cml-work-for-me" style="position:relative;">Q: I use Bitbucket. Will CML work for me?<a href="#q-i-use-bitbucket-will-cml-work-for-me" aria-label="q i use bitbucket will cml work for me permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The first release of CML is compatible with GitHub and GitLab. We've seen
<a href="https://github.com/iterative/cml/issues/140" target="_blank" rel="nofollow noopener noreferrer">many requests for Bitbucket support</a>,
and we're actively investigating how to add this. Stay tuned.</p>
<h3 id="q-i-have-on-premise-gpus-can-cml-use-them-to-execute-pipelines" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/730070747388706867" target="_blank" rel="nofollow noopener noreferrer">Q: I have on-premise GPUs. Can CML use them to execute pipelines?</a><a href="#q-i-have-on-premise-gpus-can-cml-use-them-to-execute-pipelines" aria-label="q i have on premise gpus can cml use them to execute pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yep! You can use on-premise compute resources by configuring them as self-hosted
runners. See
<a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>
and <a href="https://docs.gitlab.com/runner/" target="_blank" rel="nofollow noopener noreferrer">GitLab</a>'s official docs for more details
and setup instructions.</p>
<h3 id="q-im-building-a-workflow-that-deploys-a-gcp-compute-engine-instance-but-i-can-only-find-examples-with-aws-ec2-in-the-cml-docs-what-do-i-do" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/730688592787275806" target="_blank" rel="nofollow noopener noreferrer">Q: I'm building a workflow that deploys a GCP Compute Engine instance, but I can only find examples with AWS EC2 in the CML docs. What do I do?</a><a href="#q-im-building-a-workflow-that-deploys-a-gcp-compute-engine-instance-but-i-can-only-find-examples-with-aws-ec2-in-the-cml-docs-what-do-i-do" aria-label="q im building a workflow that deploys a gcp compute engine instance but i can only find examples with aws ec2 in the cml docs what do i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There is a slight difference in the way CML handles credentials for AWS and GCP,
and that means you'll have to modify your workflow file slightly. We've added an
example workflow for GCP to our
<a href="https://github.com/iterative/cml#allocating-cloud-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">project README</a>.</p>
<p>We've updated our
<a href="https://github.com/iterative/cml_cloud_case#using-a-different-cloud-service" target="_blank" rel="nofollow noopener noreferrer">cloud compute use case repository docs</a>
to cover a GCP example.</p>
<p>Note that for Azure, the workflow will be the same as for AWS. You'll only have
to change the arguments to <code>docker-machine</code>.</p>
<h3 id="q-i-dont-see-any-installation-instructions-in-the-cml-docs-am-i-missing-something" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/733659483758133269" target="_blank" rel="nofollow noopener noreferrer">Q: I don't see any installation instructions in the CML docs. Am I missing something?</a><a href="#q-i-dont-see-any-installation-instructions-in-the-cml-docs-am-i-missing-something" aria-label="q i dont see any installation instructions in the cml docs am i missing something permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Nope, there's no installation unless you wish to install CML in your own Docker
image. As long as you are using GitHub Actions or GitLab CI with the CML Docker
images, no other steps are needed.</p>
<p>If you're creating your own Docker image to be used in a GitHub Action or GitLab
CI pipeline, you can add CML to your image via npm:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</code></pre></div>
<h3 id="q-can-i-use-cml-with-mlflow" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=Ugw-VxQqAaqi9hmqB3t4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Can I use CML with MLFlow?</a><a href="#q-can-i-use-cml-with-mlflow" aria-label="q can i use cml with mlflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>CML is designed to integrate with lots of tools that ML teams are already
familiar with. For example, we set up a wrapper to use CML with Tensorboard, so
you get a link to your Tensorboard in a PR whenever your model is training
(<a href="https://github.com/iterative/cml_tensorboard_case/pull/3" target="_blank" rel="nofollow noopener noreferrer">check out the use case</a>).</p>
<p>While we haven't yet tried to create a use case with MLFlow in particular, we
think a similar approach could work. We could imagine using MLFlow for
hyperparameter searching, for example, and then checking in your best model with
Git to a CI system for evaluation in a production-like environment. CML could
help you orchestrate compute resources for model evaluation in your custom
environment, pulling the model and any validation data from cloud storage, and
reporting the results in a PR.</p>
<p>If this is something you're interested in, make an issue on our project
repository to tell us more about your project and needs- that lets us know it's
a priority in the community.</p>
<h3 id="q-are-there-more-tutorial-videos-coming" style="position:relative;">Q: Are there more tutorial videos coming?<a href="#q-are-there-more-tutorial-videos-coming" aria-label="q are there more tutorial videos coming permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! We recently launched
<a href="https://dvc.org/blog/first-mlops-tutorial" target="_blank" rel="nofollow noopener noreferrer">our first CML tutorial video</a>, and a
lot of folks let us know they want more. We're aiming to release a new video
every week or so in the coming months. Topics will include:</p>
<ul>
<li>Using DVC to push and pull data from cloud storage to your CI system</li>
<li>Using CML with your on-premise hardware</li>
<li>Building a data dashboard in GitHub & GitLab for monitoring changes in dynamic
datasets</li>
<li>Provisioning cloud compute from your CI system</li>
<li>Creating a custom Docker container for testing models in a production-like
environment</li>
</ul>
<p>We really want to know what use cases, questions, and issues are most important
to you. This will help us make videos that are most relevant to the community!
If you have a suggestion or idea, no matter how small, we want to know. Leave a
<a href="https://youtu.be/9BgIDqAzfuA" target="_blank" rel="nofollow noopener noreferrer">comment on our videos</a>,
<a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">reach out on Twitter</a>, or
<a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">ping us in Discord</a>.</p>https://dvc.org/blog/shtab-completion-releasehttps://dvc.org/blog/shtab-completion-releaseMon, 27 Jul 2020 00:00:00 GMT<p>Command line tools are powerful. Things like <a href="https://en.wikipedia.org/wiki/Make_(software)" target="_blank" rel="nofollow noopener noreferrer"><code>make</code></a> have manual pages
spanning, well,
<a href="https://www.gnu.org/software/make/manual/make.html#Options-Summary" target="_blank" rel="nofollow noopener noreferrer">pages</a>,
while just the list of <a href="https://git-scm.com" target="_blank" rel="nofollow noopener noreferrer"><code>git</code></a> subcommands is longer than can fit on a standard
<code>80 x 24</code> terminal screen.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token operator"><</span>TAB<span class="token operator">></span>
</span>add filter-branch rebase
am format-patch reflog
annotate fsck relink
...
describe prco unassume
--More--</code></pre></div>
<p>Notice the <code>--More--</code> at the bottom? That's the joy of pagination.</p>
<p>Notice the <code><TAB></code> at the top? That represents actually pressing the tab key.
Ah, the joy of shell tab completion.</p>
<p>Tab completion is an indispensable part of writing anything on the command-line.
Personally, I can't imagine trying to <code>git co</code> (aliased to <code>git checkout</code>) a
branch without <code><TAB></code> to do the heavy lifting.
<a href="https://en.wikipedia.org/wiki/Letter_frequency" target="_blank" rel="nofollow noopener noreferrer">They say</a> "E" is the most
common vowel, and "T" the most common consonant. My keyboard use probably looks
more like this:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2eb5660d4bd9f2a134149c2995edb0ce/065c3/key-frequencies.png" alt="key frequencies" title="Yes, I use vim" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>My
key usage</em></p>
<p>Now, there's a tool called <code>dvc</code> which is like <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Git for data</a>.
It can be viewed as a cross-platform combination of <a href="https://git-scm.com" target="_blank" rel="nofollow noopener noreferrer"><code>git</code></a> and <a href="https://en.wikipedia.org/wiki/Make_(software)" target="_blank" rel="nofollow noopener noreferrer"><code>make</code></a>
designed for handling big data and multiple cloud storage repositories, as well
as tracking machine learning experiments. As you can imagine, supporting that
many buzzwords means it also has a large number of subcommands and options.</p>
<p><em>Every time a new feature is added, maintainers and contributors have to update
tab completion scripts for multiple supported shells. At best, it's a pain, and
at worst, error-prone. If you've worked on maintaining CLI applications, you'll
sympathise.</em></p>
<p>Surely the parser code you've written is informative enough to automate tab
completion? Surely you shouldn't have to maintain and synchronise separate tab
completion scripts?</p>
<p>Good news: <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> is a new tool which magically does all of this work.</p>
<p>Any Python CLI application using <a href="https://docs.python.org/library/argparse" target="_blank" rel="nofollow noopener noreferrer"><code>argparse</code></a>, <a href="https://pypi.org/project/docopt" target="_blank" rel="nofollow noopener noreferrer"><code>docopt</code></a>, or <a href="https://pypi.org/project/argopt" target="_blank" rel="nofollow noopener noreferrer"><code>argopt</code></a> can
have tab completion for free!</p>
<p>Simply hand your parser object to <code>shtab</code> (either via the CLI or the Python
API), and a tab completion script will be generated for your preferred shell.
It's as easy as:</p>
<ul>
<li>CLI: <code>shtab --shell=bash myprogram.main.parser</code>, or</li>
<li>Python API: <code>import shtab; print(shtab.complete(parser, shell="bash"))</code>.</li>
</ul>
<h3 id="argparse-example" style="position:relative;"><code>argparse</code> example<a href="#argparse-example" aria-label="argparse example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Suppose you have some code in a module <code>hello.main</code>:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> argparse
<span class="token keyword">def</span> <span class="token function">get_main_parser</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
parser <span class="token operator">=</span> argparse<span class="token punctuation">.</span>ArgumentParser<span class="token punctuation">(</span>prog<span class="token operator">=</span><span class="token string">"hello"</span><span class="token punctuation">)</span>
parser<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span>
<span class="token string">"who"</span><span class="token punctuation">,</span> <span class="token builtin">help</span><span class="token operator">=</span><span class="token string">"good question"</span><span class="token punctuation">,</span> nargs<span class="token operator">=</span><span class="token string">"?"</span><span class="token punctuation">,</span> default<span class="token operator">=</span><span class="token string">"world"</span><span class="token punctuation">)</span>
parser<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span>
<span class="token string">"--what"</span><span class="token punctuation">,</span> <span class="token builtin">help</span><span class="token operator">=</span><span class="token string">"a better question"</span><span class="token punctuation">,</span> default<span class="token operator">=</span><span class="token string">"hello"</span><span class="token punctuation">,</span>
choices<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"hello"</span><span class="token punctuation">,</span> <span class="token string">"goodbye"</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token keyword">return</span> parser
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
parser <span class="token operator">=</span> get_main_parser<span class="token punctuation">(</span><span class="token punctuation">)</span>
args <span class="token operator">=</span> parser<span class="token punctuation">.</span>parse_args<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"{}, {}!"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>args<span class="token punctuation">.</span>what<span class="token punctuation">,</span> args<span class="token punctuation">.</span>who<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div>
<p>To get tab completion for <code>bash</code>, simply install <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> and then run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">shtab <span class="token parameter variable">--shell</span><span class="token operator">=</span>bash hello.main.get_main_parser <span class="token punctuation">\</span>
<span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> <span class="token string">"<span class="token environment constant">$BASH_COMPLETION_COMPAT_DIR</span>"</span>/hello <span class="token operator">></span>/dev/null</code></pre></div>
<p>Zsh user? Not a problem. Simply run:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">shtab <span class="token parameter variable">--shell</span><span class="token operator">=</span>zsh hello.main.get_main_parser <span class="token punctuation">\</span>
<span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> /usr/local/share/zsh/site-functions/_hello <span class="token operator">></span>/dev/null
<span class="token comment"># note the underscore `_` prefix in the filename</span></code></pre></div>
<p>Handily you can install <code>shtab</code>'s own completions by following the above
examples replacing <code>hello</code> with <code>shtab</code>.</p>
<p><img src="https://dvc.org/2020-07-27/dvc-3857db37e1b5aeb81848451e82007f50.gif" alt=""><em><code>shtab</code>-driven <code>dvc</code> completion in
<code>bash</code> and <code>zsh</code></em></p>
<p>Using <code>shtab</code>, here's what
<a href="https://dvc.org/doc/install/completion" target="_blank" rel="nofollow noopener noreferrer"><code>dvc</code>'s completion</a> looks like when
installed:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">% dvc <TAB>
Completing dvc commands
add -- Track data files or directories with DVC.
cache -- Manage cache settings.
checkout -- Checkout data files from cache.
commit -- Save changed data to cache and update DVC-files.
completion -- Prints out shell tab completion scripts.
At Top: Hit TAB for more, or the character to insert</code></pre></div>
<p>All completion suggestions guaranteed in-sync with the code! The maintainers of
<code>dvc</code> were very surprised to find no less than
<a href="https://github.com/iterative/dvc/commits/main/scripts/completion" target="_blank" rel="nofollow noopener noreferrer">84 commits</a>
touching their old completion scripts. Such churn is now a thing of the past!</p>
<p>You might notice one of the subcommands provided by <code>dvc</code> is
<a href="https://dvc.org/doc/install/completion" target="_blank" rel="nofollow noopener noreferrer"><code>completion</code></a>. Here's a quick example
of how to provide such convenience for users:</p>
<h3 id="integrating-library-example" style="position:relative;">Integrating library example<a href="#integrating-library-example" aria-label="integrating library example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Feeling minimal? How about adding <code>import shtab</code> to your application itself for
a cleaner user interface? And let's use <a href="https://pypi.org/project/argopt" target="_blank" rel="nofollow noopener noreferrer"><code>argopt</code></a> to convert <a href="https://pypi.org/project/docopt" target="_blank" rel="nofollow noopener noreferrer"><code>docopt</code></a>'s neat
syntax to <code>argparse</code> while we're at it.</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token triple-quoted-string string">"""Greetings and partings.
Usage:
greeter [options] [<you>] [<me>]
Options:
-g, --goodbye : Say "goodbye" (instead of "hello")
-b, --print-bash-completion : Output a bash tab-completion script
-z, --print-zsh-completion : Output a zsh tab-completion script
Arguments:
<you> : Your name [default: Anon]
<me> : My name [default: Casper]
"""</span>
<span class="token keyword">import</span> sys<span class="token punctuation">,</span> argopt<span class="token punctuation">,</span> shtab
parser <span class="token operator">=</span> argopt<span class="token punctuation">.</span>argopt<span class="token punctuation">(</span>__doc__<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
args <span class="token operator">=</span> parser<span class="token punctuation">.</span>parse_args<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> args<span class="token punctuation">.</span>print_bash_completion<span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>shtab<span class="token punctuation">.</span>complete<span class="token punctuation">(</span>parser<span class="token punctuation">,</span> shell<span class="token operator">=</span><span class="token string">"bash"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
sys<span class="token punctuation">.</span>exit<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> args<span class="token punctuation">.</span>print_zsh_completion<span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>shtab<span class="token punctuation">.</span>complete<span class="token punctuation">(</span>parser<span class="token punctuation">,</span> shell<span class="token operator">=</span><span class="token string">"zsh"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
sys<span class="token punctuation">.</span>exit<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
msg <span class="token operator">=</span> <span class="token string">"k thx bai!"</span> <span class="token keyword">if</span> args<span class="token punctuation">.</span>goodbye <span class="token keyword">else</span> <span class="token string">"hai!"</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"{} says '{}' to {}"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>args<span class="token punctuation">.</span>me<span class="token punctuation">,</span> msg<span class="token punctuation">,</span> args<span class="token punctuation">.</span>you<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div>
<h3 id="try-it-out" style="position:relative;">Try it out<a href="#try-it-out" aria-label="try it out permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are many more options and features. The <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer">documentation</a> includes
examples of working with custom file completions and providing a <code>completion</code>
subcommand when integrating more tightly with existing applications.</p>
<p>Try it out with <code>pip install -U shtab</code> or <code>conda install -c conda-forge shtab</code>!</p>
<p>Is it worth the time?</p>
<p><img src="https://imgs.xkcd.com/comics/is_it_worth_the_time.png" alt=""><em>It's worth it
<a href="https://xkcd.com/1205" target="_blank" rel="nofollow noopener noreferrer">xkcd#1205</a></em></p>
<p><a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> would be on the second row, far left (maybe even off grid). It's worth
spending days to get right yet only takes seconds to install.</p>https://dvc.org/blog/first-mlops-tutorialhttps://dvc.org/blog/first-mlops-tutorialFri, 24 Jul 2020 00:00:00 GMT<p>Earlier this month, we launched <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>, our latest open-source
project in the MLOps space. We think it's a step towards establishing powerful
DevOps practices (like continuous integration) as a regular fixture of machine
learning and data science projects. But there are plenty of challenges ahead,
and a big one is <em>literacy</em>.</p>
<p>So many data scientists, like developers, are self-taught. Data science degrees
have only recently emerged on the scene, which means if you polled a handful of
senior-level data scientists, there'd almost certainly be no universal training
or certificate among them. Moreover, there's still no widespread agreement about
what it takes to be a data scientist: is it an engineering role with a little
bit of Tensorflow sprinkled on top? A title for statisticians who can code?
We're not expecting an easy resolution to these existential questions anytime
soon.</p>
<p>In the meantime, we're starting a video series to help data scientists curious
about DevOps (and developers and engineeers curious about data science!) get
started. Through hands-on coding examples and use cases, we want to give data
science practitioners the fundamentals to explore, use, and influence MLOps.</p>
<p>The first video in this series uses a lightweight and fairly popular data
science problem- building a model to predict wine quality ratings- as a
playground to introduce continuous integration.</p>
<p>The tutorial covers:</p>
<ul>
<li>Using Git-flow in a data science project (making a feature branch and pull
request)</li>
<li>Creating your first GitHub Action to train and evaluate a model</li>
<li>Using CML to generate visual reports in your pull request summarizing model
performance</li>
</ul>
<p>It's now up on YouTube!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/9BgIDqAzfuA?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p><a href="https://github.com/andronovhopf/wine" target="_blank" rel="nofollow noopener noreferrer">Code for the project is available online</a>
so you can follow along! We also recommend checking out the
<a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML docs</a> for more details, tutorials, and
use cases.</p>
<p>If you have questions, the best way to get in touch is by leaving a comment on
the blog, video, or our <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. And,
we're especially interested to hear what use cases you'd like to see covered in
future videos- tell us about your data science project and how you could imagine
using continuous integration, and we might be able to create a video!</p>https://dvc.org/blog/devops-for-data-scientistshttps://dvc.org/blog/devops-for-data-scientistsThu, 16 Jul 2020 00:00:00 GMT<p>With the rapid evolution of machine learning (ML) in the last few years, it’s
become
<a href="https://towardsdatascience.com/deep-learning-isnt-hard-anymore-26db0d4749d7" target="_blank" rel="nofollow noopener noreferrer">trivially easy to begin ML experiments</a>.
Thanks to libraries like <a href="https://scikit-learn.org/stable/" target="_blank" rel="nofollow noopener noreferrer">scikit-learn</a> and
<a href="https://github.com/keras-team/keras" target="_blank" rel="nofollow noopener noreferrer">Keras</a>, you can make models with a few
lines of code.</p>
<p>But it’s harder than ever to turn data science projects into meaningful
applications, like a model that informs team decisions or becomes part of a
product. The typical ML project involves
<a href="https://ieeexplore.ieee.org/abstract/document/8804457" target="_blank" rel="nofollow noopener noreferrer">so many distinct skill sets</a>
that it’s challenging, if not outright impossible, for any one person to master
them all — so hard, the rare data scientist who can also develop quality
software and play engineer is called a unicorn!</p>
<p>As the field matures, a lot of jobs are going to require a mix of software,
engineering, and mathematical chops. Some say
<a href="https://www.anaconda.com/state-of-data-science-2020?utm_medium=press&utm_source=anaconda&utm_campaign=sods-2020&utm_content=report" target="_blank" rel="nofollow noopener noreferrer">they</a>
<a href="http://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">already</a>
<a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">do</a>.</p>
<p>To quote the unparalleled data scientist/engineer/critical observer Vicki Boykis
in her blog
<a href="http://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">Data science is different now</a>:</p>
<blockquote>
<p>What is becoming clear is that, in the late stage of the hype cycle, data
science is asymptotically moving closer to engineering, and the
<a href="https://www.youtube.com/watch?v=frQeK8xo9Ls" target="_blank" rel="nofollow noopener noreferrer">skills that data scientists need</a>
moving forward are less visualization and statistics-based, and
<a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">more in line with traditional computer science curricula</a>.</p>
</blockquote>
<h2 id="why-data-scientists-need-to-know-about-devops" style="position:relative;">Why data scientists need to know about DevOps<a href="#why-data-scientists-need-to-know-about-devops" aria-label="why data scientists need to know about devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>So which of the many, many engineering and software skills should data
scientists learn? My money is on DevOps. DevOps, a portmanteau of development
and operations, was officially born in 2009
<a href="https://en.wikipedia.org/wiki/DevOps#History" target="_blank" rel="nofollow noopener noreferrer">at a Belgian conference</a>. The
meeting was convened as a response to tensions between two facets of tech
organizations that historically experienced deep divisions. Software developers
needed to move fast and experiment often, while Operations teams prioritized
stability and availability of services (these are the people who keep servers
running day in and day out). Their goals were not only opposing, they were
competing.</p>
<p>That sounds awfully reminiscent of today’s data science. Data scientists create
value by experiments: new ways of modeling, combining, and transforming data.
Meanwhile, the organizations that employ data scientists are incentivized for
stability.</p>
<p>The consequences of this division are profound: in the
<a href="https://www.globenewswire.com/news-release/2020/06/30/2055578/0/en/Anaconda-Releases-2020-State-of-Data-Science-Survey-Results.html" target="_blank" rel="nofollow noopener noreferrer">latest Anaconda “State of Data Science” report</a>,
“fewer than half (48%) of respondents feel they can demonstrate the impact of
data science” on their organization. By some estimates, the vast majority of
<a href="https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/" target="_blank" rel="nofollow noopener noreferrer">models created by data scientists end up stuck on a shelf</a>.
We don’t yet have strong practices for passing models between the teams that
create them and the teams that deploy them. Data scientists and the developers
and engineers who implement their work have entirely different tools,
constraints, and skill sets.</p>
<p>DevOps emerged to combat this sort of deadlock in software, back when it was
developers vs. operations. And it was tremendously successful:
<a href="http://engineering.microsoft.com/devops/" target="_blank" rel="nofollow noopener noreferrer">many</a>
<a href="https://insights.sei.cmu.edu/devops/2015/02/devops-case-study-amazon-aws.html" target="_blank" rel="nofollow noopener noreferrer">teams</a>
have gone from deploying new code every few months to several times a day. Now
that we have machine learning vs. operations, it’s time to think about MLOps —
principles from DevOps that work for data science.</p>
<h2 id="introducing-continuous-integration" style="position:relative;">Introducing Continuous Integration<a href="#introducing-continuous-integration" aria-label="introducing continuous integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DevOps is both a philosophy and a set of practices, including:</p>
<ol>
<li>
<p>Automate everything you can</p>
</li>
<li>
<p>Get feedback on new ideas fast</p>
</li>
<li>
<p>Reduce manual handoffs in your workflow</p>
</li>
</ol>
<p>In a typical data science project, we can see some applications:</p>
<ol>
<li>
<p><strong>Automate everything you can.</strong> Automate parts of your data processing,
model training, and model testing that are repetitive and predictable.</p>
</li>
<li>
<p><strong>Get feedback on new ideas fast.</strong> When your data, code, or software
environment changes, test it immediately in a production-like environment
(meaning, a machine with the dependencies and constraints you anticipate
having in production).</p>
</li>
<li>
<p><strong>Reduce manual handoffs in your workflow.</strong> Find opportunities for data
scientists to test their own models as much as possible. Don’t wait until a
developer is available to see how the model will behave in a production-like
environment.</p>
</li>
</ol>
<p>The standard DevOps approach for accomplishing these goals is a method called
continuous integration (CI).</p>
<p>The gist is that when you change a project’s source code (usually, changes are
registered via git commits), your software is automatically built and tested.
Every action triggers feedback. CI is often used with
<a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">Git-flow</a>, a
development architecture in which new features are built on Git branches (need a
Git refresher?
<a href="https://towardsdatascience.com/why-git-and-how-to-use-git-as-a-data-scientist-4fa2d3bdc197" target="_blank" rel="nofollow noopener noreferrer">Try this</a>).
When a feature branch passes the automated tests, it becomes a candidate to be
merged into the master branch.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9686e1522b8cfdc441dd2fff2c34db15/39600/basic_ci_system.png" alt="basic ci system" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Here's what continuous
integration looks like in software development.</em></p>
<p>With this setup, we have automation — code changes trigger an automatic build
followed by testing. We have fast feedback, because we get test results back
quickly, so the developer can keep iterating on their code. And because all this
happens automatically, you don’t need to wait for anyone else to get feedback —
one less handoff!</p>
<p><em>So why don’t we use continuous integration already in ML?</em> Some reasons are
cultural, like a low crossover between data science and software engineering
communities. Others are technical- for example, to understand your model’s
performance, you need to look at metrics like accuracy, specificity, and
sensitivity. You might be assisted by data visualizations, like a confusion
matrix or loss plot. So pass/fail tests won’t cut it for feedback. Understanding
if a model is improved requires some domain knowledge about the problem at hand,
so test results need to be reported in an efficient and human-interpretable way.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c6eab1d9783382564176cf970c5956b1/39600/ci_for_data_system.png" alt="ci for data system" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Here's what continuous
integration might look like in a machine learning project. Inspected by Data
Science Doggy.</em></p>
<h2 id="how-do-ci-systems-work" style="position:relative;">How do CI systems work?<a href="#how-do-ci-systems-work" aria-label="how do ci systems work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now we’ll get even more practical. Let’s take a look at how a typical CI system
works. Luckily for learners, the barrier has never been lower thanks to tools
like GitHub Actions and GitLab CI- they have clear graphical interfaces and
excellent docs geared for first-time users. Since GitHub Actions is completely
free for public projects, we’ll use it for this example. It works like this:</p>
<ol>
<li>You create a GitHub repository. You create a directory called
<code>.github/workflows</code>, and inside, you place a special <code>.yaml</code> file with a
script you want to run- like,</li>
</ol>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> train.py</span></code></pre></div>
<ol start="2">
<li>You change the files in your project repository somehow and Git commit the
change. Then, push to your GitHub repository.</li>
</ol>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Create a new git branch for experimenting</span>
<span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> <span class="token string">"experiment"</span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">edit</span> train.py
</span>
<span class="token comment"># git add, commit, and push your changes</span>
<span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> <span class="token operator">&&</span> commit <span class="token parameter variable">-m</span> <span class="token string">"Normalized features"</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span> origin experiment</span></code></pre></div>
<ol start="3">
<li>
<p>As soon as GitHub detects the push, GitHub deploys one of their computers to
run the functions in your <code>.yaml</code>.</p>
</li>
<li>
<p>GitHub returns a notification if the functions ran successfully or not.</p>
</li>
</ol>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/990319ad51b933539c46b3cb7622541d/39600/run_notification.png" alt="run notification" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Find this in the Actions
tab of your GitHub repository.</em></p>
<p>That’s it! What’s really neat here is that you’re using GitHub’s computers to
run your code. All you have to do is update your code and push the change to
your repository, and the workflow happens automatically.</p>
<p>Back to that special <code>.yaml</code> file I mentioned in Step 1- let’s take a quick look
at one. It can have any name you like, as long as the file extension is <code>.yaml</code>
and it’s stored in the directory <code>.github/workflows</code>. Here’s one:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># .github/workflows/ci.yaml</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span>
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> training
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
pip install -r requirements.txt
python train.py</span></code></pre></div>
<p>There’s a lot going on, but most of it is the same from Action to Action- you
can pretty much copy and paste this standard GitHub Actions template, but fill
in your workflow in the <code>run</code> field.</p>
<p>If this file is in your project repo, whenever GitHub detects a change to your
code (registered via a push), GitHub Actions will deploy an Ubuntu runner and
attempt to execute your commands to install requirements and run a Python
script. Be aware that you have to have the files required for your workflow —
here, <code>requirements.txt</code> and <code>train.py</code> — in your project repo!</p>
<h2 id="get-better-feedback" style="position:relative;">Get better feedback<a href="#get-better-feedback" aria-label="get better feedback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As we alluded to earlier, automatic training is pretty cool and all, but it’s
important to have the results in a format that’s easy to understand. Currently,
GitHub Actions gives you access to the runner’s logs, which are plain text.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2f16dc29109a1fecb2b327d4738b8a6/39600/github_actions_log.png" alt="github actions log" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>An example printout
from a GitHub Actions log.</em></p>
<p>But understanding your model’s performance is tricky. Models and data are high
dimensional and often behave nonlinearly — two things that are especially hard
to understand without pictures!</p>
<p>I can show you one approach for putting data viz in the CI loop. For the last
few months, my team at Iterative.ai has been working on a toolkit to help use
GitHub Actions and GitLab CI for machine learning projects. It’s called
<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a> (CML for short), and it’s open
source and free.</p>
<p>Working from the basic idea of, “Let’s use GitHub Actions to train ML models,”,
we’ve built some functions to give more detailed reports than a pass/fail
notification. CML helps you put images and tables in the reports, like this
confusion matrix generated by SciKit-learn:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8e4cf67da97031a136fc7af36fee9520/39600/cml_basic_report.png" alt="cml basic report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>This report appears when
you make a Pull Request in GitHub!</em></p>
<p>To make this report, our GitHub Action executed a Python model training script,
and then used CML functions to write our model accuracy and confusion matrix to
a markdown document. Then CML passed the markdown document to GitHub.</p>
<p>Our revised <code>.yaml</code> file contains the following workflow:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model
<span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span>
<span class="token key atrule">jobs</span><span class="token punctuation">:</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span>
<span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span>
<span class="token key atrule">container</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">steps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> training
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span>
<span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string">
# train.py outputs metrics.txt and plot.png
pip3 install -r requirements.txt
python train.py</span>
<span class="token comment"># copy the contents of metrics.txt to our markdown report</span>
cat metrics.txt <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token comment"># add our confusion matrix to report.md</span>
cml publish plot.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token comment"># send the report to GitHub for display</span>
cml send<span class="token punctuation">-</span>comment report.md</code></pre></div>
<p>You can see the entire
<a href="https://github.com/iterative/cml_base_case" target="_blank" rel="nofollow noopener noreferrer">project repository here</a>. Note that
our .yaml now contains a few more configuration details, like a special Docker
container and an environmental variable, plus some new code to run. The
container and environmental variable details are standard in every CML project,
not something the user needs to manipulate, so focus on the code!</p>
<p>With the addition of these CML functions to the workflow, we’ve created a more
complete feedback loop in our CI system:</p>
<ol>
<li>
<p>Make a Git branch and change your code on that branch.</p>
</li>
<li>
<p>Automatically train model and produce metrics (accuracy) and a visualization
(confusion matrix).</p>
</li>
<li>
<p>Embed those results in a visual report in your Pull Request.</p>
</li>
</ol>
<p>Now, when you and your teammates are deciding if your changes have a positive
effect on your modeling goals, you have a dashboard of sorts to review. Plus,
this report is linked by Git to your exact project version (data and code) AND
the runner used for training AND the logs from that run. Very thorough! No more
graphs floating around your workspace that have long ago lost any connection to
your code!</p>
<p>So that’s the basic idea of CI in a data science project. To be clear, this
example is among the simplest way to work with CI. In real life, you’ll likely
encounter considerably more complex scenarios. CML also has features to help you
use large datasets stored outside your GitHub repository (using DVC) and train
on cloud instances, instead of the default GitHub Actions runners. That means
you can use GPUs and other specialized setups.</p>
<p>For example, I made a project using GitHub Actions to deploy an
<a href="https://github.com/iterative/cml_cloud_case" target="_blank" rel="nofollow noopener noreferrer">EC2 GPU and then train a neural style transfer model</a>.
Here’s my CML report:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cd248dcfa2d85a511c3c095948ed83c9/39600/cloud_report.png" alt="cloud report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Training in the cloud!
Weeeeeee!</em></p>
<p>You can also use your own Docker containers, so you can closely emulate the
environment of a model in production. I’ll be blogging more about these advanced
use cases in the future.</p>
<h2 id="final-thoughts-on-ci-for-ml" style="position:relative;">Final thoughts on CI for ML<a href="#final-thoughts-on-ci-for-ml" aria-label="final thoughts on ci for ml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To summarize what we’ve said so far:</p>
<p><strong>DevOps is not a specific technology, but a philosophy and a set of principles
and practices for fundamentally restructuring the process of creating
software.</strong> It’s effective because it <strong>addresses systemic bottlenecks</strong> in how
teams work and experiment with new code.</p>
<p>As data science matures in the coming years, people who understand how to apply
DevOps principles to their machine learning projects will be a valuable
commodity — both in terms of salary and their organizational impact. Continuous
integration is a staple of DevOps and one of the most effective known methods
for building a culture with reliable automation, fast testing, and autonomy for
teams.</p>
<p>CI can be implemented with systems like
<a href="https://github.com/features/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a> or
<a href="https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/" target="_blank" rel="nofollow noopener noreferrer">GitLab CI</a>,
and you can use these services to build automatic model training systems. The
benefits are numerous:</p>
<ol>
<li>
<p>Your code, data, models, and training infrastructure (hardware and software
environment) are Git versioned.</p>
</li>
<li>
<p>You’re automating work, testing frequently and getting fast feedback (with
visual reports if you use CML). In the long run, this will almost certainly
speed up your project’s development.</p>
</li>
<li>
<p>CI systems make your work is visible to everyone on your team. No one has to
search very hard to find the code, data, and model from your best run.</p>
</li>
</ol>
<p>And I promise, once you get into the groove, it is incredibly fun to have your
model training, recording, and reporting automatically kicked off by a single
git commit.</p>
<p>You will feel so cool.</p>
<p><img src="https://media.giphy.com/media/26AHG5KGFxSkUWw1i/giphy.gif" alt="Pixel Illustration GIF by Walter Newton"></p>
<h3 id="further-reading" style="position:relative;">Further reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>
<p><a href="https://www.martinfowler.com/articles/continuousIntegration.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Integration</a>,
the seminal Martin Fowler blog on the subject</p>
</li>
<li>
<p><a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Delivery for Machine Learning</a>,
another excellent blog on Martin Fowler’s site about building a continuous
integration & continuous delivery system for ML</p>
</li>
<li>
<p><a href="https://www.amazon.com/DevOps-Handbook-Second-World-Class-Organizations/dp/B09L56CT6N" target="_blank" rel="nofollow noopener noreferrer">The DevOps Handbook</a>,
a beloved guide that is recommended for nearly any organization (ML, software,
or not)</p>
</li>
</ul>
<p><em><strong>Note:</strong> This article has been cross-posted on Medium.</em></p>https://dvc.org/blog/july-20-dvc-heartbeathttps://dvc.org/blog/july-20-dvc-heartbeatFri, 10 Jul 2020 00:00:00 GMT<p>Welcome to the July Heartbeat, our monthly roundup of <a href="#news">new releases</a>,
<a href="#community-activity">talks</a>, <a href="#good-reads">great articles</a>, and
<a href="#coming-up-soon">upcoming events</a> in the DVC community.</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="dvc-10-release" style="position:relative;">DVC 1.0 release<a href="#dvc-10-release" aria-label="dvc 10 release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>On June 22, DVC entered a new era: the
<a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">official release of version 1.0</a>. After
several weeks of bug-catching with our pre-release, the team has issued DVC 1.0
for the public! Now when you
<a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install DVC through your package manager of choice</a>,
you'll get the latest version. Welcome to the future.</p>
<p>To recap, DVC 1.0 has some big new features like:</p>
<ul>
<li>Plots powered by Vega-Lite so you can compare metrics across commits</li>
<li>New and easier pipeline configuration files- edit your DVC pipeline like a
text file!</li>
<li>Optimizations for data transfer speed</li>
</ul>
<p>Read all the <a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">release notes</a> for more, and
stop by our <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> if you need support
migrating (don't worry, 1.0 is backwards compatible).</p>
<h3 id="virtual-meetup" style="position:relative;">Virtual meetup!<a href="#virtual-meetup" aria-label="virtual meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In May, we had our <a href="https://dvc.org/blog/may-20-dvc-heartbeat">first every virtual meetup</a>. We
had amazing talks from <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a> and
<a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a>, plus time for Q&A with the DVC
team- you can
<a href="https://www.youtube.com/watch?v=19GMtrFykSU&list=PLVeJCYrrCemiOc1SS_PIB3Tb3HX0Aqw3j" target="_blank" rel="nofollow noopener noreferrer">watch the recording</a>
if you missed it!</p>
<p>On Thursday, July 30, we're hosting our second meetup! Ambassador
<a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> is hosting once again. We'll
have short talks about causal modeling and CI/CD, plus lots of time for chatting
and catching up. Please RSVP!</p>
<blockquote class="embedly-card"><h4><a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/271844501/">July DVC Meetup: Data Science & DevOps!</a></h4><p>This meetup will be hosted by DVC Ambassador Marcel! AGENDA:We have two 10-minute talks on the agenda:- Causal Modeling with DVC - Marcel- Continuous integration for ML case studies - Elle Following talks, we'll have Q&A with the DVC team and time for community discussion.</p></blockquote>
<script async src="//cdn.embedly.com/widgets/platform.js" charset="UTF-8"></script>
<h3 id="dvc-is-in-the-top-20-fastest-growing-open-source-startups" style="position:relative;">DVC is in the top 20 fastest-growing open source startups<a href="#dvc-is-in-the-top-20-fastest-growing-open-source-startups" aria-label="dvc is in the top 20 fastest growing open source startups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Konstantin Vinogradov at <a href="https://runacap.com/" target="_blank" rel="nofollow noopener noreferrer">Runa Capital</a> used the GitHub
API to
<a href="https://medium.com/runacapital/open-source-growth-benchmarks-and-the-20-fastest-growing-oss-startups-d3556a669fe6" target="_blank" rel="nofollow noopener noreferrer">identify the fastest growing public repositories on GitHub</a>
in terms of stars and forks. He used these metrics to estimate the top 20
fastest growing startups in open source software. And guess what, DVC made the
cut! We're in great company.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f0727e088aa19b3e291c39749796bcf5/39600/top20startups.png" alt="top20startups" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="new-team-member" style="position:relative;">New team member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We have a new teammate-<a href="https://www.linkedin.com/in/mvshmakov/" target="_blank" rel="nofollow noopener noreferrer">Maxim Shmakov</a>,
previously of Yandex, is joining us! Maxim is a front-end engineer joining us
from Moscow. Please welcome him to DVC. 👋</p>
<h2 id="community-activity" style="position:relative;">Community activity<a href="#community-activity" aria-label="community activity permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We've been busy! Although we are mostly homebound these days, there has been no
shortage of speaking engagements. Here's a recap.</p>
<h3 id="meetings-and-talks" style="position:relative;">Meetings and talks<a href="#meetings-and-talks" aria-label="meetings and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>Co-founders Dmitry and Ivan appeared on the HasGeek TV series
<a href="https://hasgeek.com/fifthelephant/making-data-science-work-session-3/" target="_blank" rel="nofollow noopener noreferrer">Making Data Science Work</a>
to discuss engineering for data science with hosts
<a href="https://www.linkedin.com/in/pingali/" target="_blank" rel="nofollow noopener noreferrer">Venkata Pingali</a> and
<a href="https://www.linkedin.com/in/indrayudhghoshal/" target="_blank" rel="nofollow noopener noreferrer">Indrayudh Ghoshal</a>. The
livestream is available for viewing on YouTube!</li>
</ul>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EWcpALbzZRg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<ul>
<li>Dmitry appeared on the <a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps.community</a> meetup to
chat with host <a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer">Demetrios Brinkmann</a>.
They talked about the open source ecosystem, the difference between tools and
platforms, and what it means to codify data.</li>
</ul>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/ojV1tK9jXH8?rel=0&%3B=&%3Bshowinfo=0%3B&start=2295" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<ul>
<li>I (Elle) gave a talk at the
<a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps Production & Engineering World</a> meeting,
called "Adapting continuous integration and continuous delivery for ML". I
shared an approach to using GitHub Actions with ML projects. Video coming
soon!</li>
</ul>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Elle O'Brien is currently explaining the adaptation of continuous integration and continuous delivery for ML at <a href="https://twitter.com/hashtag/MLOPS2020?src=hash&ref_src=twsrc%5Etfw">#MLOPS2020</a>!<br><br>From explaining DVC to providing great examples - a very interesting talk with @andronovhopf taking place right now! <a href="https://t.co/dJjuLb0k4F">pic.twitter.com/dJjuLb0k4F</a></p>— Toronto Machine Learning Society (TMLS) (@TMLS_TO) <a href="https://twitter.com/TMLS_TO/status/1273693487104503808">June 18, 2020</a></blockquote>
<ul>
<li>Extremely early the next morning, clinician-scientist
<a href="https://www.linkedin.com/in/crislanting/?originalSubdomain=nl" target="_blank" rel="nofollow noopener noreferrer">Cris Lanting</a>
and I co-led a workshop about developing strong computational infrastructure
and practices in research as part of the
<a href="https://computationalaudiology.com/" target="_blank" rel="nofollow noopener noreferrer">Virtual Conference on Computational Audiology</a>.
We talked about big ideas for making scientific research reproducible,
manageable, and shareable. For the curious, the workshop is still viewable!</li>
</ul>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/W4CoptalWw0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<ul>
<li>DVC has a virtual poster at <a href="https://www.scipy2020.scipy.org/" target="_blank" rel="nofollow noopener noreferrer">SciPy 2020</a>! We
prepared a demo about
<a href="https://dvc.org/blog/scipy-2020-dvc-poster" target="_blank" rel="nofollow noopener noreferrer">packaging models and datasets like software</a>
so they can be widely disseminated via GitHub.</li>
</ul>
<h3 id="good-reads" style="position:relative;">Good reads<a href="#good-reads" aria-label="good reads permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Some excellent reading recommendations from the community:</p>
<ul>
<li>Data scientist Déborah Mesquita published a thorough guide to using new DVC
1.0 pipelines in a sample ML project. It's truly complete, covering data
collection to model evaluation, with detailed code examples. If you are new to
pipelines, do not miss this!</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The ultimate guide to building maintainable Machine Learning pipelines using DVC</h4>
<div class="elp-description">Learn the principles for building maintainable Machine Learning pipelines using DVC</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-07-10/pipes-897b33e8338e2f7b20b08ba2a175c2d7.jpg" alt="The ultimate guide to building maintainable Machine Learning pipelines using DVC">
</div>
</a>
</section>
<p></p>
<ul>
<li>Caleb Kaiser of <a href="https://github.com/cortexlabs/cortex" target="_blank" rel="nofollow noopener noreferrer">Cortex</a> (another
startup in the Runa Capital's Top 20 list!) shared a thinkpiece about
challenges from software engineering that can inform production ML. We really
agree with what he has to say about reproducibility:</li>
</ul>
<blockquote>
<p>You typically hear about “reproducibility” in reference to ML research,
particularly when a paper doesn’t include enough information to recreate the
experiment. However, reproducibility also comes up a lot in production ML.
Think of it this way — you’re on a team staffed with data scientists and
engineers, and you’re all responsible for an image classification API. The
data scientists are constantly trying new techniques and architectural tweaks
to improve the model’s baseline performance, while at the same time, the model
is constantly being retrained on new data. Looking over the APIs performance,
you see one moment a week ago where the model’s performance dropped
significantly. What caused that drop? Without knowing exactly how the model
was trained, and on what data, it’s impossible to know for sure.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/what-software-engineers-can-bring-to-machine-learning-25f458c80e5" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">What software engineers can bring to machine learning</h4>
<div class="elp-description">Many production machine learning challenges are paralleled in software engineering</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-07-10/tds-fec2edf6c4678dec52338b870f8ce9c6.jpg" alt="What software engineers can bring to machine learning">
</div>
</a>
</section>
<p></p>
<ul>
<li>Mukul Sood wrote about the Real World, a place beyond Jupyter notebooks where
data is non-stationary and servers are unreliable! He covers some very real
challenges for taking a data science project into production and introduces
the need for CI/CD practices in healthy, scalable ML applications.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/scaling-machine-learning-in-real-world-cb601b2baf4a" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Scaling Machine Learning in the Real World</h4>
<div class="elp-description">Any conversation around scaling or productionizing data science, would need to talk about Continuous Integration/Continuous Deployment.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-07-10/storm-eda2b016d57c4d44498606e0ce461437.jpg" alt="Scaling Machine Learning in the Real World">
</div>
</a>
</section>
<p></p>
<h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We'll close on a nice tweet from <a href="https://datasyndrome.com/" target="_blank" rel="nofollow noopener noreferrer">Russell Jurney</a>:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I have to say I am blown out of the water by <a href="https://twitter.com/DVCorg">@DVCorg</a><br><br>DVC is incredibly powerful. Right now we’re just versioning input/output datasets in DVC against S3, but even this is incredibly useful and so much better than trying Git LFS (ugh) or manual archiving.<a href="https://t.co/5bf5VJuPaE">https://t.co/5bf5VJuPaE</a></p>— Russell Jurney 🇺🇦 (@rjurney) <a href="https://twitter.com/rjurney/status/1266735603921547264">May 30, 2020</a></blockquote>
<p>Thanks, we couldn't do it without our community! As always, thanks for joining
us and reading. There are lots of ways to stay in touch and we always love to
hear from you. Follow us on <a href="twitter.com/dvcorg">Twitter</a>, join our
<a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a>, or leave a blog
comment. Until next time! 😎</p>https://dvc.org/blog/cml-releasehttps://dvc.org/blog/cml-releaseTue, 07 Jul 2020 00:00:00 GMT<h2 id="cicd-for-machine-learning" style="position:relative;">CI/CD for machine learning<a href="#cicd-for-machine-learning" aria-label="cicd for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Today, the DVC team is releasing a new open-source project called Continuous
Machine Learning, or CML (<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a>) to mainstream the best engineering
practices of CI/CD to AI and ML teams. CML helps to organize MLOps
infrastructure on top of the traditional software engineering stack instead of
creating separate AI platforms.</p>
<p>Continuous integration and continuous delivery (CI/CD) is a widely-used software
engineering practice. It's a validated approach to increasing the agility of
software development without sacrificing stability. <strong>But why haven't CI/CD
practices taken root in machine learning and data science so far?</strong></p>
<p>We see three substantial technical barriers to using standard CI systems with
machine learning projects:</p>
<ol>
<li><strong>Data dependencies.</strong> In ML, data plays a similar role as code: ML results
critically depend on datasets, and changes in data need to trigger feedback
just like changes in source code. Furthermore, multi-GB datasets are
challenging to manage with Git-centric CI systems.</li>
<li><strong>Metrics-driven.</strong> The traditional software engineering idea of pass/fail
tests does not apply in ML. As an example, <code>+0.72% accuracy</code> and
<code>-0.35% precision</code> does not answer the question if the ML model is good or
not. Detailed reports with metrics and plots are needed to make a good/bad
model discussion</li>
<li><strong>CPU/GPU resources</strong>. ML training often requires more resources to train
then is typical to have in CI/CD runners. CI/CD must be connected with cloud
computing instances or Kubernetes clusters for ML training.</li>
</ol>
<h2 id="cicd-for-ml-is-the-next-step-for-the-dvc-team" style="position:relative;">CI/CD for ML is the next step for the DVC team<a href="#cicd-for-ml-is-the-next-step-for-the-dvc-team" aria-label="cicd for ml is the next step for the dvc team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Since the beginning, our motivation has been helping ML teams benefit from
DevOps. We started DVC because we knew that data management would be a crucial
bottleneck, and sure enough, DVC was a big step towards making pipelines and
experiments manageable and reproducible. But conversations with our community
have brought us to one conclusion again and again: CI/CD for ML is the holy
grail.</p>
<p>Over the last 3 years, we've reached some big milestones:</p>
<ol>
<li>
<p>We built DVC to address the ML data management problem. Recently, we
<a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">released DVC 1.0</a>, marking a new and
more stable era for our API.</p>
</li>
<li>
<p>DVC has become a core part of many ML team's daily operations. The latest
<a href="https://www.thoughtworks.com/radar/tools" target="_blank" rel="nofollow noopener noreferrer">ThoughtWorks Technology Radar</a>
says:</p>
<p><em>"… it [DVC] has become a favorite tool for managing experiments in machine
learning (ML) projects. Since it's based on Git, DVC is a familiar
environment for software developers to bring their engineering practices to
ML practice."</em></p>
</li>
<li>
<p>An extraordinary team and community have emerged around DVC:</p>
<ul>
<li>15 employees in our organization <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">https://iterative.ai</a></li>
<li>100+ open-source contributors to DVC <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc</a> and
another 100+ open-source contributors to docs
<a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc.org</a></li>
<li>2000+ community members in our Discord <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a> and GitHub
issue tracker <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc</a></li>
<li>4000+ regular users of DVC</li>
</ul>
</li>
</ol>
<p>Now that DVC is maturing, we're ready to take the next step: we want to
revolutionize the ML development processes. We want ML experiments to have
greater visibility to teammates, shorter feedback loops, and more
reproducibility. We want teams to spend less time managing their computing
resources and experiments, and more time building value. The goal is to extend
the amazing results of DevOps from software development to ML and MLOps.</p>
<h2 id="continuous-machine-learning-release" style="position:relative;"><em>Continuous Machine Learning</em> release<a href="#continuous-machine-learning-release" aria-label="continuous machine learning release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Today, we're releasing an open-source project <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a> to close the gap
between machine learning and software development practices.</p>
<p>CML is a library of functions used inside CI/CD runners to make ML compatible
with <strong>GitHub Actions</strong> and <strong>GitLab CI</strong>. We've created functions to:</p>
<ol>
<li>Generate informative reports on every Pull/Merge Request with metrics, plots,
and hyperparameters changes.</li>
<li>Provision GPU\CPU resources from cloud service providers (<strong>AWS, GCP, Azure,
Ali</strong>) and deploy CI runners using
<a href="https://github.com/docker/machine" target="_blank" rel="nofollow noopener noreferrer">Docker Machine</a>.</li>
<li>Bring datasets from cloud storage to runners (using <strong>DVC</strong>) for model
training, as well as save the resulting model in cloud storage.</li>
</ol>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4abfb3a481ef05b3f8a0140ede1bda90/39600/cml-report-metrics.png" alt="Auto-generated metrics-driven report in GitLab Merge Request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>The workflow and visual reports are customizable by modifying the CI
configuration file in your GitHub <code>./github/workflows/*.yaml</code> or GitLab
<code>.gitlab-ci.yml</code> project. Use CML functions in conjunction with your own ML
model training and testing scripts to create your own automated workflow and
reporting system.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># GitLab workflow in '.gitlab-ci.yml' file</span>
<span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> cml_run
<span class="token key atrule">cml</span><span class="token punctuation">:</span>
<span class="token key atrule">stage</span><span class="token punctuation">:</span> cml_run
<span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1
<span class="token key atrule">script</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> dvc pull data <span class="token punctuation">-</span><span class="token punctuation">-</span>run<span class="token punctuation">-</span>cache
<span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>r requirements.txt
<span class="token punctuation">-</span> dvc repro
<span class="token comment"># Compare metrics to master</span>
<span class="token punctuation">-</span> git fetch <span class="token punctuation">-</span><span class="token punctuation">-</span>prune
<span class="token punctuation">-</span> dvc metrics diff <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>md master <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token comment"># Visualize loss function diff</span>
<span class="token punctuation">-</span> dvc plots diff <span class="token punctuation">-</span><span class="token punctuation">-</span>target loss.csv <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json
<span class="token punctuation">-</span> vl2png vega.json <span class="token punctuation">></span> plot.png
<span class="token punctuation">-</span> cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md
<span class="token punctuation">-</span> dvc push data <span class="token punctuation">-</span><span class="token punctuation">-</span>run<span class="token punctuation">-</span>cache
<span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment report.md</code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 614px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0fd9c743e967356a65ffee9780c681ec/39600/cml-report-params.png" alt="Hyperparameter change with a result image in GitHub Pull request report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>In this example all the CML functions are defined in the <strong>docker images</strong> that
is used in the workflow - <code>iterativeai/cml:0-dvc2-base1</code>. Users can specify any
docker image. The only restriction is that the CML library need to be installed
to enable all the CML commands for the reporting and graphs:</p>
<div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">npm</span> i @dvcorg/cml</code></pre></div>
<p>Examples of docker images can be found in <code>docker</code> directory of the CML the
repository: <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML repository</a>.</p>
<p>As you can see, CML is based on the assumption that MLOps can work with
traditional engineering tools. It shouldn't require an entirely separate
platform. We're excited about a world where DevOps practitioners can work
fluently on both software and ML aspects of a project.</p>
<h2 id="the-relationship-between-cml-and-dvc" style="position:relative;">The relationship between CML and DVC<a href="#the-relationship-between-cml-and-dvc" aria-label="the relationship between cml and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>CML and DVC are related projects under the umbrella of the same team, but will
have separate websites and independent development. The CML project is hosted on
a new web site: <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a>. The source code and issue tracker is on GitHub:
<a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/cml</a></p>
<p>For support and communications, the DVC Discord server is still the place to go:
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a> We've made a new <code>#cml</code> channel there to discuss CML, CI/CD
for ML and other MLOps related questions.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>With the rise of AI/ML teams and ML platforms in addition to the software
engineering stack, we believe that the industry needs a single technology stack
to work with software as well as AI projects. A simple layer of a tool is
required to close the gap between AI projects and software projects to fit them
into the existing stack and CML is the way to make it.</p>
<p>Our philosophy is that ML projects, and MLOps practices, should be built on top
of traditional engineering tools and not as a separate stack. A simple layer of
tools will be required to close the gap, and CML is part of this ecosystem. We
think this is the future of MLOps.</p>
<p>As always, thanks for reading and for being part of the DVC community. We'd love
to hear what you think about CML. Please be in touch on
<a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</p>https://dvc.org/blog/june-20-community-gemshttps://dvc.org/blog/june-20-community-gemsMon, 29 Jun 2020 00:00:00 GMT<h2 id="highlights-from-discord" style="position:relative;">Highlights from Discord<a href="#highlights-from-discord" aria-label="highlights from discord permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p>
<h3 id="q-i-just-upgraded-to-dvc-10-ive-got-some-pipeline-stages-currently-saved-as-dvc-files-is-there-an-easy-way-to-convert-the-old-dvc-format-to-the-new-dvcyaml-standard" style="position:relative;">Q: I just upgraded to DVC 1.0. I've got some pipeline stages currently saved as <code>.dvc</code> files. <a href="https://discord.com/channels/485586884165107732/563406153334128681/725019219930120232" target="_blank" rel="nofollow noopener noreferrer">Is there an easy way to convert the old <code>.dvc</code> format to the new <code>dvc.yaml</code> standard?</a><a href="#q-i-just-upgraded-to-dvc-10-ive-got-some-pipeline-stages-currently-saved-as-dvc-files-is-there-an-easy-way-to-convert-the-old-dvc-format-to-the-new-dvcyaml-standard" aria-label="q i just upgraded to dvc 10 ive got some pipeline stages currently saved as dvc files is there an easy way to convert the old dvc format to the new dvcyaml standard permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! You can easily transfer the stages by hand: <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> is designed for
manual edits in any text editor, so you can type your old stages in and then
delete the old <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. We also have a
<a href="https://gist.github.com/skshetry/07a3e26e6b06783e1ad7a4b6db6479da" target="_blank" rel="nofollow noopener noreferrer">migration script</a>
available, although we can't provide long-term support for it.</p>
<p>Learn more about the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> format in our
<a href="https://dvc.org/doc/user-guide/dvc-files#dvcyaml-file" target="_blank" rel="nofollow noopener noreferrer">brand new docs</a>!</p>
<p><img src="https://media.giphy.com/media/JYpTAnhT0EI2Q/giphy.gif" alt="Year Opening GIF"></p>
<p><em>Just like this but with technical documentation.</em></p>
<h3 id="q-after-i-pushed-my-local-data-to-remote-storage-i-noticed-the-file-names-are-different-in-my-storage-repository--theyre-hash-values-can-i-make-them-more-meaningful-names" style="position:relative;">Q: After I pushed my local data to remote storage, I noticed the file names are different in my storage repository- they're hash values. <a href="https://discord.com/channels/485586884165107732/563406153334128681/717737163122540585" target="_blank" rel="nofollow noopener noreferrer">Can I make them more meaningful names?</a><a href="#q-after-i-pushed-my-local-data-to-remote-storage-i-noticed-the-file-names-are-different-in-my-storage-repository--theyre-hash-values-can-i-make-them-more-meaningful-names" aria-label="q after i pushed my local data to remote storage i noticed the file names are different in my storage repository theyre hash values can i make them more meaningful names permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>No, but for a good reason! What you're seeing are cached files, and they're
stored with a special naming convention that makes DVC versioning and addressing
possible- these file names are how DVC deduplicates data (to avoid keeping
multiple copies of the same file version) and ensures that each unique version
of a file is immutable. If you manually overwrote those filenames you would risk
breaking Git version control. You can
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read more about how DVC uses this file format in our docs</a>.</p>
<p>It sounds like you're looking for ways to interact with DVC-tracked objects at a
high level of abstraction, meaning that you want to interface with the original
filenames and not the machine-generated hashes used by DVC. There are a few
secure and recommended ways to do this:</p>
<ul>
<li>If you want to see a human-readable list of files that are currently tracked
by DVC, try the <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a>
command-<a href="https://dvc.org/doc/command-reference/list" target="_blank" rel="nofollow noopener noreferrer">read up on it here</a>.</li>
<li>Check out our
<a href="https://dvc.org/doc/use-cases/data-registries#data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry tutorial</a>
to see how the commands <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> are used to download and
share DVC-tracked artifacts. The syntax is built for an experience like using
a package manager.</li>
<li>The <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a> gives you programmatic
access to DVC-tracked artifacts, using human-readable filenames.</li>
</ul>
<h3 id="q-is-it-better-practice-to-dvc-add-data-files-individually-or-to-add-a-directory-containing-multiple-data-files" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/722141190312689675" target="_blank" rel="nofollow noopener noreferrer">Is it better practice to <code>dvc add</code> data files individually, or to add a directory containing multiple data files?</a><a href="#q-is-it-better-practice-to-dvc-add-data-files-individually-or-to-add-a-directory-containing-multiple-data-files" aria-label="q is it better practice to dvc add data files individually or to add a directory containing multiple data files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If the directory you're adding is logically one unit (for example, it is the
whole dataset in your project), we recommend using <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> at the directory
level. Otherwise, add files one-by-one. You can
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read more about how DVC versions directories in our docs</a>.</p>
<h3 id="q-do-you-have-any-examples-of-using-dvc-with-minio" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/722780202844815362" target="_blank" rel="nofollow noopener noreferrer">Do you have any examples of using DVC with MinIO?</a><a href="#q-do-you-have-any-examples-of-using-dvc-with-minio" aria-label="q do you have any examples of using dvc with minio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We don't have any tutorials for this use case exactly, but it's a very
straightforward modification from
<a href="https://dvc.org/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">our basic use cases</a>. The key difference when
using MinIO or a similar S3-compatible storage (like DigitalOcean Spaces or IBM
Cloud Object Storage) is that in addition to setting remote data storage, you
must set the <code>endpointurl</code> too. For example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> myremote s3://mybucket/path/to/dir
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote endpointurl https://object-storage.example.com</span></code></pre></div>
<p>Read up on configuring supported storage
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>.</p>
<h3 id="q-if-i-have-a-folder-containing-many-data-files-is-there-any-advantage-to-zipping-the-folder-and-dvc-tracking-the-zip" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/714922184455225445" target="_blank" rel="nofollow noopener noreferrer">If I have a folder containing many data files, is there any advantage to zipping the folder and DVC tracking the <code>.zip</code>?</a><a href="#q-if-i-have-a-folder-containing-many-data-files-is-there-any-advantage-to-zipping-the-folder-and-dvc-tracking-the-zip" aria-label="q if i have a folder containing many data files is there any advantage to zipping the folder and dvc tracking the zip permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are a few things to consider:</p>
<ul>
<li>
<p><strong>CPU time.</strong> Even though it can be faster to pull a single file than a
directory (though not in all cases, since we can parallelize directory
downloads), the tradeoff is the time needed to unzip your data. Depending on
your constraints, this can be expensive and undesirable.</p>
</li>
<li>
<p><strong>Deduplication.</strong> DVC deduplicates on the file level. So if you add one
single file to a directory, DVC will save only that file, not the whole
dataset again. If you use a zipped directory you won't get this benefit. In
the long run, this could be more expensive in terms of storage space for your
DVC cache and remote if the contents of your dataset change frequently.</p>
</li>
</ul>
<p>Generally, we would recommend first trying a plain unzipped directory. DVC is
designed to work with large numbers of files (on the order of millions) and has
the latest release (DVC 1.0) has
<a href="https://dvc.org/blog/dvc-1-0-release#data-transfer-optimizations" target="_blank" rel="nofollow noopener noreferrer">optimizations built for this purpose exactly</a>.</p>
<h3 id="q-can-i-execute-a-dvc-push-with-the-dvc-python-api-inside-a-python-script" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/718419219288686664" target="_blank" rel="nofollow noopener noreferrer">Q: Can I execute a <code>dvc push</code> with the DVC Python API inside a Python script?</a><a href="#q-can-i-execute-a-dvc-push-with-the-dvc-python-api-inside-a-python-script" aria-label="q can i execute a dvc push with the dvc python api inside a python script permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Currently, our <a href="https://dvc.org/doc/api-reference#python-api" target="_blank" rel="nofollow noopener noreferrer">Python API</a>
doesn't support commands like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>,<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, or <a href="https://dvc.org/doc/command-reference/status"><code>dvc status</code></a>. It is
designed for interfacing with objects tracked by DVC. That said, CLI commands
are basically calling <code>dvc.repo.Repo</code> object methods. So if you want to use
commands from within Python code, you could try creating a <code>Repo</code> object with
<code>r = Repo({root_dir})</code> and then <code>r.push()</code>. Please note that we don't officially
support this use case yet.</p>
<p>Of course, you can also run DVC commands from a Python script using <code>sys</code> or a
similar library for issuing system commands.</p>
<h3 id="q-does-the-dvc-pipeline-command-for-visualizing-pipelines-still-work-in-dvc-10" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/717682556203565127" target="_blank" rel="nofollow noopener noreferrer">Q: Does the <code>dvc pipeline</code> command for visualizing pipelines still work in DVC 1.0?</a><a href="#q-does-the-dvc-pipeline-command-for-visualizing-pipelines-still-work-in-dvc-10" aria-label="q does the dvc pipeline command for visualizing pipelines still work in dvc 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Most of the <code>dvc pipeline</code> functionality- like <code>dvc pipeline show --ascii</code> to
print out an ASCII diagram of your pipeline- has been migrated to a new command,
<a href="https://dvc.org/doc/command-reference/dag"><code>dvc dag</code></a>. This function is written for our new pipeline format. Check out
<a href="https://dvc.org/doc/command-reference/dag#dag" target="_blank" rel="nofollow noopener noreferrer">our new docs</a> for an example.</p>
<h3 id="q-is-there-a-way-to-create-a-dvc-pipeline-stage-without-running-the-commands-in-that-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/715271980978405447" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to create a DVC pipeline stage without running the commands in that stage?</a><a href="#q-is-there-a-way-to-create-a-dvc-pipeline-stage-without-running-the-commands-in-that-stage" aria-label="q is there a way to create a dvc pipeline stage without running the commands in that stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes. Say you have a Python script, <code>train.py</code>, that takes in a dataset <code>data</code>
and outputs a model <code>model.pkl</code>. To create a DVC pipeline stage corresponding to
this process, you could do so like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> train
</span> -d train.py -d data
-o model.pkl
python train.py</code></pre></div>
<p>However, this would automatically rerun the command <code>python train.py</code>, which is
not necessarily desirable if you have recently run it, the process is time
consuming, and the dependencies and outputs haven't changed. You can use the
<code>--no-exec</code> flag to get around this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">--no-exec</span>
</span> -n train
-d train.py -d data
-o model.pkl
python train.py</code></pre></div>
<p>This flag can also be useful when you want to define the pipeline on your local
machine but plan to run it later on a different machine (perhaps an instance in
the cloud).
<a href="https://dvc.org/doc/command-reference/run" target="_blank" rel="nofollow noopener noreferrer">Read more about the <code>--no-exec</code> flag in our docs.</a></p>
<p>One other approach worth mentioning is that you can manually edit your
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file to add a stage. If you add a stage this way, pipeline commands
won't be executed until you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p>https://dvc.org/blog/scipy-2020-dvc-posterhttps://dvc.org/blog/scipy-2020-dvc-posterFri, 26 Jun 2020 00:00:00 GMT<p>When I was doing my Ph.D., every time I published a paper I shared a public
GitHub repository with my dataset and scripts to reproduce my statistical
analyses. While it took a bit of work to get the repository in good shape for
sharing (cleaning up code, adding documentation), the process was
straightforward: upload everything to the repo!</p>
<p>But when I started working on deep learning projects, things got considerably
more complicated. For example, in a
<a href="https://pudding.cool/2019/11/big-hair/" target="_blank" rel="nofollow noopener noreferrer">data journalism project I did with The Pudding</a>,
I wanted to understand how hair style (particularly size!) changed over the
years. There were a lot of moving parts:</p>
<ul>
<li>A public dataset of yearbook photos released and maintained by
<a href="https://people.eecs.berkeley.edu/~shiry/projects/yearbooks/yearbooks.html" target="_blank" rel="nofollow noopener noreferrer">Ginosar et al.</a></li>
<li>A deep learning model I trained to segment the hair in yearbook photos</li>
<li>A derivative dataset of "hair maps" for each photo in the original datasetr</li>
<li>All the code to train the deep learning model and analyse the derivative
dataset</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b058a7be71c126ec336b730fa3dc7718/39600/hairflow.png" alt="hairflow" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>The parts of my big-hair-data
project: an original public dataset, a model for segmenting the images, a
derivative dataset of segment maps, and analysis scripts.</em></p>
<p>How would you share this with a collaborator, or open it up to the public?
Throwing it all in a GitHub repository was not an option. My model wouldn't fit
on GitHub because it was over the 100 MB size limit. I also wanted to preserve a
clear link between my derived dataset and the original- it should be obvious
exactly how I got the public dataset. And if that public dataset were to ever
change, I would ideally want it to be clear what version I used for my analyses.</p>
<p>This blog is about several different ways of "releasing" data science projects,
with an emphasis on preserving meaningful links about the origins of derived
data and models. I'm not making any strong assumptions about whether project
materials are relased within an organization (only to teammates, for example) or
to the whole internet.</p>
<p>Let's look at a few methods.</p>
<h1 id="method-one-artifacts-in-the-cloud" style="position:relative;">Method One: artifacts in the cloud<a href="#method-one-artifacts-in-the-cloud" aria-label="method one artifacts in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>When you work with big models and datasets, you often can't host them in a
GitHub repo. But you can put them in cloud storage, and then provide a script in
your GitHub repo to download them. For example, in the fantastic <code>gpt-2-simple</code>
<a href="https://github.com/minimaxir/gpt-2-simple" target="_blank" rel="nofollow noopener noreferrer">project by Max Woolf</a>, Max stores
huge GPT-2 models in Google Drive and provides a script to download a specified
model to a user's local workspace if it isn't already there.</p>
<p>Likewise, the <a href="https://github.com/NVlabs/stylegan" target="_blank" rel="nofollow noopener noreferrer">Nvidia StyleGAN release</a>
provides a hardcoded URL to their model in Google Drive storage. Both the
<code>gpt-2-simple</code> and StyleGAN projects have custom scripts to handle these big
downloads, and largely thanks to the work of the project maintainers, users only
interact with the downloading process at a very high level.</p>
<p>Considering some pros and cons of this approach:</p>
<table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">It's easy to put a model in a bucket</td><td align="center">Hardcoded links are brittle</td></tr><tr><td align="center">Works for pip packages</td><td align="center">Need to write custom functions</td></tr><tr><td align="center">No extra tools, just Python scripting</td><td align="center">Downloads aren't versioned</td></tr></tbody></table>
<h1 id="method-two-hubs-catalogs--zoos" style="position:relative;">Method Two: Hubs, Catalogs & Zoos<a href="#method-two-hubs-catalogs--zoos" aria-label="method two hubs catalogs zoos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>There are a (growing) number of websites willing to long-term host big models
and datasets, plus relevant meta-data, code, and publications. Some even allow
you to upload several versions of a project- it's not Git, for sure, but even
basic version control is something.</p>
<p>For example, <a href="https://pytorch.org/hub/" target="_blank" rel="nofollow noopener noreferrer">PyTorch Hub</a> lets researchers publish
trained models developed in the PyTorch framework, along with code and papers.
It's easily searched and browsed, which makes projects discoverable.</p>
<p>For a dataset analog, Kaggle is similar- they host user-submitted datasets and
help other users find them. Both PyTorch Hub and Kaggle have APIs for
programmatically downloading artifacts.</p>
<table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">Browsable & discoverable</td><td align="center">Centrally managed</td></tr><tr><td align="center">Public</td><td align="center">Public (no granularity)</td></tr><tr><td align="center">Good with big models</td><td align="center">Weak versioning support</td></tr></tbody></table>
<h1 id="method-three-packaging-with-dvc" style="position:relative;">Method Three: Packaging with DVC<a href="#method-three-packaging-with-dvc" aria-label="method three packaging with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, or Data Version Control, is a Python project for
extending Git version control to large project artifacts like datasets and
models. It's not a replacement for Git- DVC works <em>with</em> Git!</p>
<p>The basic idea is that your datasets and models are stored in a DVC repository,
which can be any cloud storage or server of your choice. DVC creates metadata
about file versions that can be tracked by Git and hosted on GitHub- so you can
share your datasets and models like any GitHub project, with all the benefits of
versioning. Let's look at a case study.</p>
<h2 id="creating-a-dvc-project" style="position:relative;">Creating a DVC project<a href="#creating-a-dvc-project" aria-label="creating a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Say I have a project containing a dataset, model training code, and model.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span>
</span>data.csv
train.py
model.pkl</code></pre></div>
<p>Say our model and dataset are large and we want to track them with DVC. For
remote storage, we want to use a personal S3 bucket. We would run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> myremote s3://mybucket/myproject
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data.csv model.pkl
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div>
<p>When I run these commands, I've initialized Git and DVC tracking. Next, I've set
a DVC repository- my S3 bucket. Then I've added <code>data.csv</code> and <code>model.pkl</code> to
DVC tracking. Finally, when I run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, the model and dataset are pushed
to the S3 bucket. On my local machine, two meta-files are created:
<code>data.csv.dvc</code> and <code>model.pkl.dvc</code>. These can be tracked with Git!</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span>
</span>data.csv.dvc
train.py
model.pkl.dvc</code></pre></div>
<p>So after setting a remote Git repository, <code>git add</code>, <code>commit</code> and <code>push</code> like
usual (assuming you are a regualr Git user, that is):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git remote add</span> origin [email protected]:elle/myproject
</span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> <span class="token operator">&&</span> <span class="token function">git</span> commit <span class="token parameter variable">-m</span> <span class="token string">"first commit"</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span> origin master</span></code></pre></div>
<h2 id="package-management-with-dvc" style="position:relative;">Package management with DVC<a href="#package-management-with-dvc" aria-label="package management with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Now let's say one of my teammates wants to access my work so far- specifically,
they want to see if another method for constructing features from raw data will
help model accuracy. I've given them permission to access my GitHub repository.
On their local machine, they'll run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://github.com/elle/myproject data.csv model.pkl</span></code></pre></div>
<p>This will download the latest version of the <code>data.csv</code> and <code>model.pkl</code>
artifacts to their local machine, as well as the DVC metafiles <code>data.csv.dvc</code>
and <code>model.pkl.csv</code> indicating the precise version and source.</p>
<p>Collaborators can also download artifacts from previous versions, releases, or
parallel feature branches of a project. For example, if I released a new version
of my project with a Git tag (say <code>v.2.0.1</code>), collaborators can run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> v.2.0.1 <span class="token punctuation">\</span>
https://github.com/elle/myproject data.csv</span></code></pre></div>
<p>Lastly, because <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> maintains a link between the downloaded artifacts
and my repository, collaborators can check for project updates with</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc update</span> data.csv model.pkl</span></code></pre></div>
<p>If new versions are detected, DVC automatically syncs the local workspace with
those versions.</p>
<h2 id="when-should-you-do-this" style="position:relative;">When should you do this?<a href="#when-should-you-do-this" aria-label="when should you do this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In my own experience releasing a large public dataset with DVC, I've seen
several benefits:</p>
<ul>
<li>Within an hour, someone found data points I'd been missing. It was
straightforward to make a new release after patching this error.</li>
<li>Several people modeled my dataset! Highly rewarding.</li>
<li>Since GitHub is a widely used platform for code sharing, it's a natural fit
for open source scientific projects and has little overhead for potential
collaborators</li>
</ul>
<p>To return to the pros and cons table:</p>
<table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">Git version your dataset</td><td align="center">No GUI access to files in DVC remote</td></tr><tr><td align="center">Granular sharing permissions</td><td align="center">Collaborators need to use DVC</td></tr><tr><td align="center">DVC abstracts away download scripts/hardcoded URLs</td><td align="center">Can be serverless, but you need to manage cloud storage</td></tr></tbody></table>
<h1 id="the-bottom-line" style="position:relative;">The bottom line<a href="#the-bottom-line" aria-label="the bottom line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h1>
<p>Packaging models and datasets is a non-trivial part of the machine learning
workflow. DVC provides a method for giving users a Git-centric experience of
cloning or forking these artifacts, with an emphasis on <em>versioning artifacts</em>
and <em>abstracting away the processes of uploading, downloading, and storing
artifacts</em>. For projects with high complexity- like my hair project, which had
some gnarly dependencies and big artifacts- this kind of source control pays
off. If you don't know where your data came from or how it's been transformed,
it's impossible to be scientific.</p>
<p>Thanks for stopping by our virtual poster! I'm happy to take questions or
comments about how version control fits into the scientific workflow. Leave a
comment, reach out on Twitter, or send an email.</p>
<h2 id="further-reading" style="position:relative;">Further reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><em>Check out our
<a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">tutorial about creating a data registry</a>
for more code examples.</em></p>https://dvc.org/blog/dvc-1-0-releasehttps://dvc.org/blog/dvc-1-0-releaseMon, 22 Jun 2020 00:00:00 GMT<h2 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>3 years ago, I was concerned about good engineering standards in data science:
data versioning, reproducibility, workflow automation — like continuous
integration and continuous delivery (CI/CD), but for machine learning. I wanted
there to be a "Git for data" to make all this possible. So I created DVC (Data
Version Control), which works as version control for data projects.</p>
<p>Technically, DVC codifies your data and machine learning pipelines as text
metafiles (with pointers to actual data in S3/GCP/Azure/SSH), while you use Git
for the actual versioning. DevOps folks call this approach GitOps or, more
specifically, in this case <em>DataOps</em> or <em>MLOps</em>.</p>
<p>The new DVC 1.0. is inspired by discussions and contributions from our community
of data scientists, ML engineers, developers and software engineers.</p>
<h2 id="dvc-10" style="position:relative;">DVC 1.0<a href="#dvc-10" aria-label="dvc 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The new DVC 1.0 is inspired by discussions and contributions from our community
— both fresh ideas and bug reports 😅. All these contributions, big and small,
have a collective impact on DVC's development. I'm confident 1.0 wouldn't be
possible without our community. They tell us what features matter most, which
approaches work (and which don't!), and what they need from DVC to support their
ML projects.</p>
<p>A few weeks ago we announced the 1.0 pre-release. After lots of helpful feedback
from brave users, it's time to go live. Now, DVC 1.0 is available with all the
standard installation methods including <code>pip</code>, <code>conda</code>, <code>brew</code>, <code>choco</code>, and
system-specific packages: deb, rpm, msi, pkg. See <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/install</a>
for more details.</p>
<h2 id="new-features" style="position:relative;">New features<a href="#new-features" aria-label="new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It took us 3 years to finalize the requirements for DVC 1.0 and stabilize the
commands (API) and DVC file formats. Below are the major lessons that we have
learned in 3 years of this journey and how these are reflected in the new DVC.</p>
<h3 id="multi-stage-dvc-files" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">Multi-stage DVC files</a><a href="#multi-stage-dvc-files" aria-label="multi stage dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our users taught us that ML pipelines evolve much faster than data engineering
pipelines with data processing steps. People need to change the commands of the
pipeline often and it was not easy to do this with the old DVC-files.</p>
<p>In DVC 1.0, the DVC metafile format was changed in three big ways. First,
instead of multiple DVC "stage files" (<code>*.dvc</code>), each project has a single
<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. By default, all stages go in this single YAML file.</p>
<p>Second, we made clear connections between the <code>dvc run</code> command (a helper to
define pipeline stages), and how stages are defined in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Many of the
options of <code>dvc run</code> are mirrored in the metafile. We wanted to make it far less
complicated to edit an existing pipeline by making <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> more human
readable and writable.</p>
<p>Third, file and directory hash values are no longer stored in the pipeline
metafile. This approach aligns better with the GitOps paradigms and simplifies
the usage of DVC by tremendously improving metafile human-readability:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">process</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> ./process_raw_data raw_data.log users.csv
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> raw_data.log
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> process_file
<span class="token punctuation">-</span> click_threshold
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> users.csv
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> users.csv
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> epochs
<span class="token punctuation">-</span> log_file
<span class="token punctuation">-</span> dropout
<span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> logs.csv
<span class="token punctuation">-</span> <span class="token key atrule">summary.json</span><span class="token punctuation">:</span>
<span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> model.pkl</code></pre></div>
<p>All of the hashes have been moved to a special file, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>, which is a lot
like the old DVC-file format. DVC uses this lock file to define which data files
need to be restored to the workspace from data remotes (cloud storage) and if a
particular pipeline stage needs to be rerun. In other words, we're separating
the human-readable parts of the pipeline into <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, and the auto-generated
"machine" parts into <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>.</p>
<p>Another cool change: the auto-generated part (<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>) doesn't necessarily
have to be stored in your Git repository. The new run-cache feature eliminates
the need of storing the lock file in Git repositories. That brings us to our
next big feature:</p>
<h3 id="run-cache" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">Run cache</a><a href="#run-cache" aria-label="run cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We built DVC with a workflow in mind: one experiment to one commit. Some users
love it, but this approach gets clunky fast for others (like folks who are
grid-searching a hyperparameter space). Making Git commits for each ML
experiment was a requirement with the old DVC, if you wanted to snapshot your
project or pipelines on each experiment. Moving forward, we want to give users
more flexibility to decide how often they want to commit.</p>
<p>We had an insight that data remotes (S3, Azure Blob, SSH etc) can be used
instead of Git for storing the codified meta information, not only data. In DVC
1.0, a special structure is implemented, the run-cache, that preserves the state
(including all the hashes). Basically, all the information that is stored in the
new <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file is replicated in the run-cache.</p>
<p>The advantage of the run-cache is that pipeline runs (and output file versions)
are not directly connected to Git commits anymore. The new DVC can store all the
runs in the run-cache, even if they were never committed to Git.</p>
<p>This approach gives DVC a "long memory" of DVC stages runs. If a user tries to
run a stage that was previously run (whether committed to Git or not), then DVC
can return the result from the run-cache without rerunning it. It is a useful
feature for a hyperparameter optimization stage — when users return to the
previous sets of the parameters and don't want to wait for ML retraining.</p>
<p>Another benefit of the run-cache is related to CI/CD systems for ML, which is a
holy grail of MLOps. The long memory means users don't have to make auto-commits
in their CI/CD system side - see
<a href="https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments" target="_blank" rel="nofollow noopener noreferrer">this Stackowerflow question</a>.</p>
<h3 id="plots" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">Plots</a><a href="#plots" aria-label="plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Countless users have asked when we'd support metrics visualizations. It became
clear that metrics and their visualization are an essential part of <em>DataOps</em>,
especially when it comes down to navigation around ML models, datasets and
experiments. Now it's here: DVC 1.0 introduces metrics file visualization
commands, <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a> and <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a>. This is brand-new functionality
in DVC and it's <em>in experimental mode</em> now.</p>
<p>This function is designed not only for visualizing the current state of your
project, but also for comparing plots across your Git history. Users can
visualize how, for example, their model accuracy in the latest commit differs
from another commit (or even multiple commits).</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march
</span>file:///Users/dmitry/src/plot/logs.csv.html
<span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div>
<p><img src="https://dvc.org/2020-05-04/dvc-plots-092248e6898ab510fc3803efb5e22d9f.svg" alt=""></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march <span class="token punctuation">\</span>
<span class="token parameter variable">-x</span> loss <span class="token parameter variable">--template</span> scatter
</span>file:///Users/dmitry/src/plot/logs.csv.html
<span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div>
<p><img src="https://dvc.org/2020-05-04/dvc-plots-scatter-9cfc6c2078273faa482129d8d1609967.svg" alt=""></p>
<p>DVC plots are powered by the
<a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite graphic library</a>. We picked Vega
because it's high-level to manipulate, compatible with all ML frameworks, and
looks great out of the box. However, you don't have to know Vega to use DVC
plots: we've provided default templates for line graphs, scatterplots, and
confusion matrices, so you can just point DVC plots to your metrics and go.</p>
<h3 id="data-transfer-optimizations" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3488" target="_blank" rel="nofollow noopener noreferrer">Data transfer optimizations</a><a href="#data-transfer-optimizations" aria-label="data transfer optimizations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In <em>DataOps</em>, data transfer speed is hugely important. We've done substantial
work to optimize data management commands, like
<a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a>. Now, based on the amount of data to move,
DVC can choose the optimal strategy for traversing your data remote.</p>
<p><a href="https://github.com/iterative/dvc/issues/2147" target="_blank" rel="nofollow noopener noreferrer">Mini-indexes</a> help DVC instantly
check data directories instead of iterating over millions of files. This also
speeds up adding/removing files to/from large directories.</p>
<p>More optimizations are included in the release based on our profiling of
performance bottlenecks. More detailed
<a href="https://gist.github.com/pmrowla/338d9645bd05df966f8aba8366cab308" target="_blank" rel="nofollow noopener noreferrer">benchmark reports</a>
show how many seconds it takes to run specific commands on a directory
containing 2 million images.</p>
<p><img src="https://dvc.org/2020-05-04/benchmarks-fb3909a1a199bbfdfb5b66b689e2ffb0.svg" alt=""></p>
<h3 id="hyperparameter-tracking" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">Hyperparameter tracking</a><a href="#hyperparameter-tracking" aria-label="hyperparameter tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This feature was actually released in the last DVC 0.93 version (see the
<a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">params docs</a>. However, it is an
important step to support configuration files and ML experiments in a more
holistic way.</p>
<p>The parameters are a special type of dependency in the pipelines. This is the
way of telling DVC that a command depends not on a file (<code>params.yaml</code>) but on a
particular set of values in the file:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-o</span> model.pkl <span class="token punctuation">\</span>
<span class="token parameter variable">--params</span> lr,train.epochs,train.layers <span class="token punctuation">\</span>
python train.py</span></code></pre></div>
<p>The <code>params.yaml</code> file is the place where the parameters are stored:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">lr</span><span class="token punctuation">:</span> <span class="token number">0.0041</span>
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">70</span>
<span class="token key atrule">layers</span><span class="token punctuation">:</span> <span class="token number">9</span>
<span class="token key atrule">process</span><span class="token punctuation">:</span>
<span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">0.98</span>
<span class="token key atrule">bow</span><span class="token punctuation">:</span> <span class="token number">15000</span></code></pre></div>
<h3 id="stable-releases-cycles" style="position:relative;">Stable releases cycles<a href="#stable-releases-cycles" aria-label="stable releases cycles permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Today, many teams use DVC in their daily job for modeling and as part of their
production MLOps automation systems. Stability plays an increasingly important
role.</p>
<p>We've always prioritized agility and speed in our development process. There
have been weeks with two DVC releases! This approach had a ton of benefits in
terms of learning speed and rapid feedback from users.</p>
<p>Now we're seeing signs that it's time to shift gears. Our API is stabilized and
version 1.0 is built with our long-term vision in mind. Our user-base has grown
and we have footing with mature teams - teams that are using DVC in
mission-critical systems. That's why we're intentionally going to spend more
time on release testing in the future. It might increase the time between
releases, but the quality of the tool will be more predictable.</p>
<p>Additionally, we've already implemented a benchmark testing framework to track
performance across potential releases: <a href="https://iterative.github.io/dvc-bench/" target="_blank" rel="nofollow noopener noreferrer">https://iterative.github.io/dvc-bench/</a> In
this website, anyone can see the performance improvements and degradations for
every release candidate:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 605px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/44a8b224d00774bd0be1a55b2c98ad45/39600/dvc-benchmark.png" alt="dvc benchmark" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="for-more-information-on-the-new-features" style="position:relative;">For more information on the new features…<a href="#for-more-information-on-the-new-features" aria-label="for more information on the new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Each of these new features has a story that could fill a separate blog post - so
that's what we'll be doing. We'll be posting more soon.
<a href="https://github.com/pmrowla" target="_blank" rel="nofollow noopener noreferrer">Peter Rowlands</a> will be writing a blog post about
the performance optimization in DVC 1.0,
<a href="https://github.com/pared" target="_blank" rel="nofollow noopener noreferrer">Paweł Redzyński</a> about versioning and visualizing
plots, <a href="https://github.com/skshetry" target="_blank" rel="nofollow noopener noreferrer">Saugat Pachhai</a> about the new DVC file
formats and pipelines, and <a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan Kuprieiev</a> about
run-cache.</p>
<p>Please stay in touch and subscribe to our newsletter in <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">http://dvc.org</a>.</p>
<h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>It's quite a journey to build an open source project in the ML/AI space. We're
fortunate to have a community of DVC users, contributors and cheerleaders. All
these folks tremendously help us to define, test and develop the project. We've
reached this significant milestone of version 1.0 together and I hope we'll
continue working on DVC and bringing the best practices of DataOps and MLOps to
the ML/AI space.</p>
<p>Thank you again! And please be in touch on
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/june-20-dvc-heartbeathttps://dvc.org/blog/june-20-dvc-heartbeatMon, 08 Jun 2020 00:00:00 GMT<p>Welcome to the June Heartbeat, our monthly roundup of cool happenings,
<a href="#from-the-community">good reads</a> and
<a href="#coming-up-soon">up-and-coming developments</a> in the DVC community.</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the beginning of May, we
<a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">pre-released DVC 1.0</a>. Ever
since, we've been putting the final touches on 1.0- wrapping up features, fixing
bugs 🐛, and responding to feedback from intrepid users
<a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">trying the pre-release</a>. To recap,
here are some of the big features coming:</p>
<ul>
<li>
<p><strong>Plots powered by Vega-Lite</strong> We're building
<a href="https://dvc.org/doc/command-reference/plots#plots" target="_blank" rel="nofollow noopener noreferrer">functions for visualizing metrics</a>
in your project, as well as comparing metrics across commits. We chose
<a href="https://github.com/vega/vega-lite" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite plots</a> because they're
high-level, compatible with ML projects written in any language, and beautiful
by default.</p>
</li>
<li>
<p><strong>Human readable and writeable pipelines.</strong> We're reworking pipelines so you
can modify dependencies, outputs, metrics, plots, and entire stages easily:
via manual edits to a <code>.yaml</code> pipeline fines. This redesign will consolidate
pipeline <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files into a single file (yay, simpler working directory). No
worries for pipeline enthusiasts- DVC 1.0 is backwards compatible, so your
existing projects won't be interrupted.</p>
</li>
<li>
<p><strong>Run cache.</strong> One of the most exciting features is the run-cache, a local
record of pipeline versions that have previously been run and the outputs of
those runs. It can seriously cut down on compute time if you find yourself
repeating pipeline executions. For our CI/CD users, it also offers a way to
save the output of your pipeline- like models or results-
<a href="https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments" target="_blank" rel="nofollow noopener noreferrer">without auto-commits</a>.</p>
</li>
</ul>
<p>DVC 1.0 work has been our top priority this past month, and we are <em>extremely
close</em> to the releae. Think 1-2 weeks!</p>
<p>Another neat announcement: DVC moved up on
<a href="https://www.thoughtworks.com/radar/tools" target="_blank" rel="nofollow noopener noreferrer">ThoughtWorks Technology Radar</a>! To
quote ThoughtWorks:</p>
<blockquote>
<p>In 2018 we mentioned DVC in conjunction with the versioning data for
reproducible analytics. Since then it has become a favorite tool for managing
experiments in machine learning (ML) projects. Since it's based on Git, DVC is
a familiar environment for software developers to bring their engineering
practices to ML practice. Because it versions the code that processes data
along with the data itself and tracks stages in a pipeline, it helps bring
order to the modeling activities without interrupting the analysts’ flow.</p>
</blockquote>
<p>And here we are on the radar, in the Trial zone:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 377px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e25ad107f180331a3e3ddca2064d16d5/39600/radar.png" alt="radar" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Blip, blip, blip!</em></p>
<p>We are honored. In fact, this was validating in several ways. We field a lot of
questions about our decision to build around Git, rather than creating a
platform. It's awesome to know our approach is resonating with teams at the
intersection of ML and software development. Thanks, ThoughtWorks!</p>
<p>Last up in company news: you might recall that in early May, we hosted an online
meetup. <a href="http://mribeirodantas.me" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> hosted guest talks
from <a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a> and
<a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>- we heard about constructing a new
COVID-19 dataset, using DVC with transformer language models, and building
custom cloud infrastructure for MLOps. There's also Q&A with the DVC team, where
we fielded audience questions. A video of the meetup is available now, so check
it out if you missed the event.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/19GMtrFykSU?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As usual, there's a ton of noteworthy action in the DVC community.</p>
<p><a href="https://twitter.com/dhaynes23" target="_blank" rel="nofollow noopener noreferrer">Derek Haynes</a>, MLOps expert and new
<a href="https://dvc.org/blog/dvc-ambassador-program-announcement" target="_blank" rel="nofollow noopener noreferrer">DVC Ambassador</a>-
wrote an excellent overview of using
<a href="https://github.com/features/codespaces/" target="_blank" rel="nofollow noopener noreferrer">GitHub CodeSpaces</a>. CodeSpaces is a
new development environment (currently in beta) that we're eagerly watching. As
Derek shows in his blog, it lets you have a Jupyter Notebook experience without
sacrificing on development standards- he uses
<a href="https://docs.whisk-ml.org/en/latest/" target="_blank" rel="nofollow noopener noreferrer">whisk</a> to structure the project and
manage Python package dependencies, and DVC to version the model training
pipeline.</p>
<p>This use case is telling in the
<a href="https://towardsdatascience.com/the-case-against-the-jupyter-notebook-d4da17e97243" target="_blank" rel="nofollow noopener noreferrer">battle over Jupyter notebooks</a>:
we might just be able to have both a notebook <em>and</em> mature project management.
Give Derek's blog a read and tell us what you think.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://dlite.cc/2020/05/26/github-codespaces-machine-learning.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">GitHub Codespaces for Machine Learning</h4>
<div class="elp-description">With Codespaces, contributors can spin up a ready-to-go GitHub project-specific dev environment in the cloud. In this post, I’ll show how to give potential contributors a graceful start by configuring Codespaces for an ML project.</div>
<div class="elp-link">dlite.cc</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-06-08/derek_haynes-c4995bfd3020af81632b8d434220c631.jpg" alt="GitHub Codespaces for Machine Learning">
</div>
</a>
</section>
<p></p>
<p>DVC Ambassador Marcel gave a tutorial about DVC to a bioinformatics student
group, and then an even bigger talk at the Federal University of Rio Grande de
Norte. His talk focused on how to use DVC in the context of scientific
reproducibility- specifically, large biological datasets, which are often
transformed and processed several times before ML models are fit. In my
experience, Git-flow is severely underutilized in life sciences research, so
it's exciting to see Marcel's ideas getting a big audience.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="pt" dir="ltr">Interessados(as) na área de Ciência de Dados? Na próxima sexta-feira as 14h teremos uma palestra sobre uma das novíssimas ferramentas da área, o DVC - Data Version Control!!! Não percam essa oportunidade. <a href="https://twitter.com/ufrnbr">@ufrnbr</a> <a href="https://twitter.com/PropesqUFRN">@PropesqUFRN</a> <a href="https://t.co/AmXxz7ioVG">pic.twitter.com/AmXxz7ioVG</a></p>— ppgeecufrn (@ppgeecufrn) <a href="https://twitter.com/ppgeecufrn/status/1263260554443005954">May 21, 2020</a></blockquote>
<p>Also, Marcel is the first author of a new scientific paper about mobility data
across 131 countries during the COVID-19 pandemic. The preprocessing pipeline is
versioned with DVC. We don't know how Marcel gets this much done.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.sciencedirect.com/science/article/pii/S2352340920305928" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Dataset for country profile and mobility analysis in the assessment of COVID-19 pandemic</h4>
<div class="elp-description">M. Ribeiro-Dantas, G. Alves, R.B. Gomes, L.C.T. Bezerra, L. Lima and I. Silva</div>
<div class="elp-link">sciencedirect.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-06-08/data_in_brief_logo-2b4c63ad516deb095fbb38327c04e53d.jpeg" alt="Dataset for country profile and mobility analysis in the assessment of COVID-19 pandemic">
</div>
</a>
</section>
<p></p>
<p>Also just released is a scientific paper by Christoph Jansen et al. about a
framework for computational reproducibility in the life sciences that integrates
DVC. The framework is called
<a href="https://github.com/curious-containers/curious-containers" target="_blank" rel="nofollow noopener noreferrer">Curious Containers</a>-
definitely worth checking out for biomedical researchers interested in deep
learning.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.sciencedirect.com/science/article/abs/pii/S0167739X19318096" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Curious Containers: A framework for computational reproducibility in life sciences with support for Deep Learning applications</h4>
<div class="elp-description">C. Jansen, J. Annuscheit, B. Schilling, K. Strohmenger, M. Whitt, F. Bartusch, C. Herta, P. Hufnagl, and D. Krefting</div>
<div class="elp-link">sciencedirect.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-06-08/fgcs_cover-77a45f0e89b711a9797ac2137c86e70b.jpg" alt="Curious Containers: A framework for computational reproducibility in life sciences with support for Deep Learning applications">
</div>
</a>
</section>
<p></p>
<p>In other work of vital interest to the good of humanity, this month has seen
some awesome applictions of the
<a href="https://dvc.org/blog/a-public-reddit-dataset" target="_blank" rel="nofollow noopener noreferrer">public Reddit dataset we released in February</a>.
The dataset is designed for an NLP task of mighty importance: will Redditors
vote that the poster is an asshole, or not?</p>
<p>Daniele Gentile beat our benchmark classifier (62% accuracy, but not bad for
logistic regression!) with Doc2Vec embeddings and a 500-neuron network. He got
71% accuracy on held out data- nice! His blog is a fun read, and code's included
if you want to follow along.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@danielegentili/artificial-intelligence-confirms-you-are-an-a-hole-e8eef354dc2" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Artificial Intelligence confirms you are an a**hole</h4>
<div class="elp-description">Q-LO is a small artificial brain that can determine if you are the a**hole or not in a situation from its description.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-06-08/medium_logo-45140ce1eb5fe8d0caed749229873cca.png" alt="Artificial Intelligence confirms you are an a**hole">
</div>
</a>
</section>
<p></p>
<p>Elsewhere on the internet, data scientist Dan Cassin delivered this beautiful
tweet:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Used a dataset from <a href="https://t.co/6yDX1A9Rga">https://t.co/6yDX1A9Rga</a> on <a href="https://twitter.com/Reddit">@reddit</a>'s AITA, used NLTK for processing, TFIDF, then UMAP, and the result is the coolest, but most unhelpful graph I've ever made. <a href="https://twitter.com/matplotlib">@matplotlib</a> <a href="https://t.co/fYpuvwTIYE">pic.twitter.com/fYpuvwTIYE</a></p>— Dan Cassin (@Dan_Cassin) <a href="https://twitter.com/Dan_Cassin/status/1256999648901787648">May 3, 2020</a></blockquote>
<p>Last, I want to point you to two other excellent blogs.
<a href="https://github.com/curiousily" target="_blank" rel="nofollow noopener noreferrer">Venelin Valkov</a> released a blog,
<a href="https://www.curiousily.com/posts/reproducible-machine-learning-and-experiment-tracking-pipiline-with-python-and-dvc/" target="_blank" rel="nofollow noopener noreferrer">Reproducible machine learning and experiment tracking pipeline with Python and DVC</a>,
that contains not only a detailed sample project but a livecoding video!</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/6_kK6wRtzhk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p><a href="https://www.linkedin.com/in/matthewmcateer0/" target="_blank" rel="nofollow noopener noreferrer">Matthew McAteer</a> revisited the
famous 2015
<a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf" target="_blank" rel="nofollow noopener noreferrer">Hidden Technical Debt in Machine Learning Systems</a>
paper to ask which recommendations still work five years later. It's pretty
great-
<a href="https://matthewmcateer.me/blog/machine-learning-technical-debt/" target="_blank" rel="nofollow noopener noreferrer">please read it</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 279.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/53bf1aaa1578f0c5f73792de45c56435/4e10f/spongebob.png" alt="spongebob" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Meme by Matthew McAteer. Click
to enlarge.</em></p>
<h2 id="coming-up-soon" style="position:relative;">Coming up soon<a href="#coming-up-soon" aria-label="coming up soon permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are a couple of events to look forward to in the next few weeks. I'll be
speaking at two conferences: first,
<a href="https://mlopsworld.com/program/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> about CI/CD and ML. Next, I'm
<a href="https://computationalaudiology.com/the-critical-role-of-computing-infrastructure-in-computational-audiology/" target="_blank" rel="nofollow noopener noreferrer">organizing a workshop</a>
at the Virtual Conference on Computational Audiology. To get ready, I'm
gathering resources about good computing practices for scientists and biomedical
research labs-
<a href="https://github.com/andronovhopf/Lab_Computing_Resources" target="_blank" rel="nofollow noopener noreferrer">contributions are welcome</a>.</p>
<p>Another talk on our radar is at EuroPython 2020. Engineer
<a href="https://ep2020.europython.eu/talks/CXG7TcM-automating-machine-learning-workflow-with-dvc/" target="_blank" rel="nofollow noopener noreferrer">Hongjoo Lee will be talking about building a CI/CD workflow for ML with DVC</a>-
we're very interested to learn about their approach.</p>
<p>Lastly, <a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML REPA</a> leader and new DVC Ambassador
<a href="https://twitter.com/mnrozhkov" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> is working on a Udemy course
about DVC. Look for more updates this summer!</p>
<p>Thanks for reading this month. As always, we're proud of the ways our community
works for better, more rigorous ML.</p>https://dvc.org/blog/may-20-community-gemshttps://dvc.org/blog/may-20-community-gemsTue, 26 May 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p>
<h3 id="q-how-do-i-completely-delete-a-file-from-dvc" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/710546561498873886" target="_blank" rel="nofollow noopener noreferrer">How do I completely delete a file from DVC?</a><a href="#q-how-do-i-completely-delete-a-file-from-dvc" aria-label="q how do i completely delete a file from dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To stop tracking a file with DVC, you can simply delete the file and its
corresponding <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file (if there is one) from your project. But, what if you
want to entirely erase a file from DVC?</p>
<p>After deleting the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file, you'll usually want to
<a href="https://dvc.org/doc/command-reference/gc#gc" target="_blank" rel="nofollow noopener noreferrer">clear your DVC cache</a>. Ordinarily,
that's done with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>. However, if there's any chance the file you wish to
remove might be referenced by another commit (even under a different name), be
sure to use the right flag: <a href="https://dvc.org/doc/command-reference/gc#--all-commits"><code>dvc gc --all-commits</code></a>.</p>
<p>If you want to remove a single <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file without doing a cache cleanup, look
into the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file and note the <code>md5</code> field inside. Then use this value to
identify the corresponding file in your <code>.dvc/cache</code> and delete it. For example:
if your target file has <code>md5</code>: 123456, the corresponding file in your cache will
be <code>.dvc/cache/12/3456</code>.</p>
<p>There's one last case worth mentioning: what if you're deleting a file inside a
DVC-tracked folder? For example, say you've previously run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">dvc add data_dir</code></pre></div>
<p>and now want to remove a single file (say, <code>image_1.png</code>) from <code>data_dir</code>. When
DVC starts tracking a directory, it creates a corresponding <code>.dir</code> file inside
<code>.dvc/cache</code> that lists every file and subfolder, as well as an <code>md5</code> for each,
in a JSON format. You'll want to locate this <code>.dir</code> file in the cache, and then
find the entry corresponding to <code>image_1.png</code>. It'll give the <code>md5</code> for
<code>image_1.png</code>. Finally, go back to <code>.dvc/cache</code>, identify the file corresponding
to that <code>md5</code>, and delete it. For detailed instructions about <code>.dir</code> files,
where to find them and how they're used,
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">see our docs about the structure of the cache</a>.</p>
<p>Having said all this… please know that in the future, we plan to support a
function like <code>git rm</code> that will allow easier deletes from DVC!</p>
<h3 id="q-is-it-safe-to-add-a-custom-file-to-my-dvc-remote" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230" target="_blank" rel="nofollow noopener noreferrer">Is it safe to add a custom file to my DVC remote?</a><a href="#q-is-it-safe-to-add-a-custom-file-to-my-dvc-remote" aria-label="q is it safe to add a custom file to my dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Definitely. Some people add additional files to their DVC remote, like a README
to explain to teammates what the folder is being used for. Having an additional
file in the remote that isn't part of DVC tracking won't pose any issues. You
would only encounter problems if you were manually modifying or deleting
contents of the remote managed by DVC.</p>
<h3 id="q-are-there-limits-to-how-many-files-dvc-can-handle-my-dataset-contains-100000-files" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/706538115048669274" target="_blank" rel="nofollow noopener noreferrer">Are there limits to how many files DVC can handle? My dataset contains ~100,000 files.</a><a href="#q-are-there-limits-to-how-many-files-dvc-can-handle-my-dataset-contains-100000-files" aria-label="q are there limits to how many files dvc can handle my dataset contains 100000 files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We ourselves have stored datasets containing up to 2 million files, so 100,000
is certainly feasible. Of course, the larger your dataset, the more time data
transfer operations will take. Luckily, we have a
<a href="https://dvc.org/blog/dvc-3-years-and-1-0-release#data-transfer-optimizations" target="_blank" rel="nofollow noopener noreferrer">DVC 1.0 contains several data transfer optimizations</a>
to substantially reduce the time needed to <a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a>
for very large datasets.</p>
<h3 id="q-two-developers-on-my-team-are-doing-dvc-push-to-the-same-remote-should-they-dvc-pull-first" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/704211629075857468" target="_blank" rel="nofollow noopener noreferrer">Two developers on my team are doing <code>dvc push</code> to the same remote. Should they <code>dvc pull</code> first?</a><a href="#q-two-developers-on-my-team-are-doing-dvc-push-to-the-same-remote-should-they-dvc-pull-first" aria-label="q two developers on my team are doing dvc push to the same remote should they dvc pull first permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It's safe to push simultaneously, no <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> needed. While some teams might
be in the habit of frequently pulling, like in Git flow, there are less risks of
"merge conflicts" in DVC. That's because DVC remotes stores files indexed by
<code>md5</code>s, so there's usually a very low probability of a collision (if two
developers have two different versions of a file, they'll be stored as two
separate files in the DVC remote- so no merge conflicts).</p>
<h3 id="q-what-are-tmp-files-in-my-dvc-remote" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/698163554095857745" target="_blank" rel="nofollow noopener noreferrer">What are <code>*.tmp</code> files in my DVC remote?</a><a href="#q-what-are-tmp-files-in-my-dvc-remote" aria-label="q what are tmp files in my dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Inside your DVC remote, you might see <code>.tmp</code> files from incomplete uploads. This
can happen if a user killed a process like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>. You can safely remove
them; for example, if you're using an S3 bucket, <code>aws s3 rm ... *.tmp</code> will do
the trick.</p>
<p>One caveat: before you delete, make sure no one is actively running <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>.</p>
<h3 id="q-im-using-a-google-cloud-platform-gcp-bucket-as-a-dvc-remote-and-getting-an-error-any-ideas" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/485596304961962003/705131622537756702" target="_blank" rel="nofollow noopener noreferrer">I'm using a Google Cloud Platform (GCP) bucket as a DVC remote and getting an error. Any ideas?</a><a href="#q-im-using-a-google-cloud-platform-gcp-bucket-as-a-dvc-remote-and-getting-an-error-any-ideas" aria-label="q im using a google cloud platform gcp bucket as a dvc remote and getting an error any ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you're getting the error,</p>
<div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ERROR: unexpected error - ('invalid_grant: Bad Request', '{\n "error": "invalid_grant",\n "error_description": "Bad Request"\n}')</code></pre></div>
<p>something is going wrong with your GCP authentication! A few things to check:
first,
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a>
to <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> a Google Cloud bucket as your remote. Note that before DVC
can use this type of remote, you have to configure your credentials through the
GCP CLI
(<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">see docs here</a>).</p>
<p>If you're still getting an error, DVC probably can't find the <code>.json</code>
credentials file for your GCP bucket. Try authenticating using
<code>gcloud beta auth application-default login</code>. This command obtains your access
credentials and places them in a <code>.json</code> in your local workspace.</p>
<h3 id="q-im-working-on-several-projects-that-all-need-involve-the-same-saved-model-one-project-trains-a-model-and-pushes-it-to-cloud-storage-with-dvc-push-and-another-takes-the-model-out-of-cloud-storage-for-use-whats-the-best-practice-for-doing-this-with-dvc" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/485596304961962003/708318821253120040" target="_blank" rel="nofollow noopener noreferrer">I'm working on several projects that all need involve the same saved model. One project trains a model and pushes it to cloud storage with <code>dvc push</code>, and another takes the model out of cloud storage for use. What's the best practice for doing this with DVC?</a><a href="#q-im-working-on-several-projects-that-all-need-involve-the-same-saved-model-one-project-trains-a-model-and-pushes-it-to-cloud-storage-with-dvc-push-and-another-takes-the-model-out-of-cloud-storage-for-use-whats-the-best-practice-for-doing-this-with-dvc" aria-label="q im working on several projects that all need involve the same saved model one project trains a model and pushes it to cloud storage with dvc push and another takes the model out of cloud storage for use whats the best practice for doing this with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>One of DVC's goals is to make it easy to move models and datasets in and out of
cloud storage. We had this in mind when we designed the function <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> -
it lets you reuse artifacts from one project to another. And you can quickly
synchronize an artifact, like a model or dataset, with its latest version using
<a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>. Check out our
<a href="https://dvc.org/doc/command-reference/import" target="_blank" rel="nofollow noopener noreferrer">docs about <code>import</code></a>, and also
our <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry use case</a> for
an example of sharing artifacts across projects.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 690.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/da720a4b7b9b33a811b2b4fb6b176e86/39600/data-registry.png" alt="data registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Using DVC for sharing
artifacts like datasets and models across projects and teammates.</em></p>https://dvc.org/blog/may-20-dvc-heartbeathttps://dvc.org/blog/may-20-dvc-heartbeatThu, 14 May 2020 00:00:00 GMT<p>Welcome to the May Heartbeat, our <a href="#news">monthly roundup of cool happenings</a>,
<a href="#new-releases">new releases</a>, <a href="#from-the-community">good reads</a> and other
noteworthy developments the DVC community.</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>DVC turns 3.</strong> On May 4th, we celebrated DVC's third birthday! Fearless leader
Dmitry Petrov
<a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">wrote a retrospective</a> about
how the team has grown and what we've learned from our users, contributors, and
colleagues. Thanks to everyone who celebrated with us!</p>
<p><strong>Ambassador program launched.</strong> DVC has just kicked off our ambassador program
with the help of our first ambassador,
<a href="https://twitter.com/messages/40813700-894970070358564864" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a>.
Marcel is an early-stage researcher at the Institut Curie, a veteran
<a href="https://fedoraproject.org/wiki/User:Mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">ambassador of the Fedora Project</a>,
and a <a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">data science blogger</a>. Becoming an ambassador
is a way for folks who are passionate about contributing to the DVC community to
get recognized for their efforts. It's also a way for us to help volunteers with
financial support for meetups and travel, as well as chances to work more
closely with our team. The program is ideal for anyone who already likes
blogging about DVC, contributing code, and hosting get-togethers (virtual or
otherwise), but especially advanced students and early career data scientists
and engineers!
<a href="https://dvc.org/blog/dvc-ambassador-program-announcement" target="_blank" rel="nofollow noopener noreferrer">Learn more about it here.</a></p>
<p><strong>DVC is part of 2020 Google Season of Docs.</strong> Another way to get involved with
DVC is through
<a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a>, a program
we're participating in for the second year in a row. This program is for
technical writers to get paid experience working with the DVC team in fall 2020.
Right now, we're accepting proposals from interested writers.
<a href="https://dvc.org/blog/gsod-ideas-2020" target="_blank" rel="nofollow noopener noreferrer">Find out more here.</a></p>
<p><strong>5000 GitHub Stars.</strong> It finally happened- we passed 5,000 stars
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">on our GitHub repo!</a></p>
<p><img src="https://media.giphy.com/media/igWE67cPgTrWwXq4Nz/giphy.gif" alt="Animated GIF"></p>
<h2 id="new-releases" style="position:relative;">New releases<a href="#new-releases" aria-label="new releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Coincident with DVC's 3rd birthday, we shared a pre-release of DVC 1.0. The
release is expected in a few weeks, but you can experiment with 1.0 now (and
make <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">tickets in our project repo</a> if you get
a bug 🐛). Some major new features include:</p>
<ul>
<li>
<p><strong>Run cache</strong>, a cache of pipelines you've reproduced on your local workspace.
If you re-run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> on a pipeline version that's already been executed,
run cache will save you compute time by returning the cached result.</p>
</li>
<li>
<p><strong>Multi-stage DVC files</strong>. Users reported that their DVC pipelines changed a
lot, so we've made pipeline <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files more human-readable and editable for
fast redesigns.</p>
</li>
<li>
<p><strong>Plots</strong> We've got plots powered by
<a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a> for making beautiful
vizualizations comparing model performance across commits! Developer Paweł
Redzyński is hard at work:</p>
</li>
</ul>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Visual aids come to DVC 1.0, with my little help. <a href="https://t.co/Fd1qVr7rHb">pic.twitter.com/Fd1qVr7rHb</a></p>— Pablito (@Paffciu1) <a href="https://twitter.com/Paffciu1/status/1260119918525194241">May 12, 2020</a></blockquote>
<p>You can read more about the big updates coming in DVC 1.0
<a href="https://dvc.org/blog/dvc-3-years-and-1-0-release#dvc-10-is-the-result-of-3-years-of-learning" target="_blank" rel="nofollow noopener noreferrer">in our birthday blog</a>.</p>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Developers weren't the only ones hustling this month…</p>
<p><strong>First ever virtual DVC Meetup.</strong> Marcel, our new ambassador, lead an
initiative to
<a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">organize a virtual meetup</a>!
Marcel shared his latest scientific work about creating a
<a href="https://www.sciencedirect.com/science/article/pii/S2352340920305928?via%3Dihub" target="_blank" rel="nofollow noopener noreferrer">new comprehensive dataset about mobility</a>
during the COVID-19 pandemic and then passed off the mic to our two guest
speakers. Data scientist <a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a> spoke how
she was building a workflow for her NLP team with DVC, and
<a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> co-founder
<a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a> shared his custom remote file system
setup for modeling Reddit post popularity. It was quite well-attended for our
first ever virtual hangout: we logged 40 individual logins to the meetup with
more than 30 people staying the whole time! A video of the meetup is
<a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">on the event page</a>, so
you can still check out the talks and discussion we enjoyed.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">It was awesome speaking at the <a href="https://twitter.com/DVCorg">@DVCorg</a> meetup about <a href="https://twitter.com/Reddit">@reddit</a> post popularity prediction and DVC <a href="https://twitter.com/hashtag/remote?src=hash&ref_src=twsrc%5Etfw">#remote</a> working file systems. Also a lot of <a href="https://twitter.com/hashtag/DAGs?src=hash&ref_src=twsrc%5Etfw">#DAGs</a>. <a href="https://t.co/5WKTlIEvHK">pic.twitter.com/5WKTlIEvHK</a></p>— Dean 🐶 (@DeanPlbn) <a href="https://twitter.com/DeanPlbn/status/1258475031530790916">May 7, 2020</a></blockquote>
<p><strong>Some blogs we like.</strong> As usual, there's a lot of share-worthy writing in the
data science and MLOps space:</p>
<ul>
<li><a href="https://twitter.com/ixek" target="_blank" rel="nofollow noopener noreferrer">Tania Allard</a> wrote an intensely readable,
extremely sharp guide to practical steps anyone can take to improve the
reproducibility of their ML projects. She really nails the complexity of the
workflow and the importance of decoupling code and data (which we obviously
agree with very much 😏). The graphics are also 💯- Tania is a developer
advocate to follow.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://dev.to/azure/10-top-tips-for-reproducible-machine-learning-36g0" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">10 top tips for reproducible Machine Learning</h4>
<div class="elp-description">The one where you get some advice to make your workflows more reproducible</div>
<div class="elp-link">dev.to</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-05-14/dev_logo-4362b64c557ebe87d5d8d21ad965ffaf.png" alt="10 top tips for reproducible Machine Learning">
</div>
</a>
</section>
<p></p>
<ul>
<li><a href="https://medium.com/@vimarshk" target="_blank" rel="nofollow noopener noreferrer">Vimarsh Karbhari</a> blogged about how teams that
work with data can strategize better about versioning their data and analysis
pipelines. On the opposite end of giving very practical recommendations,
Vimarsh stresses a deliberate and careful approach. He emphasizes how the
team's choices should depend on factors like project maturity and how much
flexibility is going to be needed. It's a solid overview of how to begin
thinking about MLOps at a high level.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/acing-ai/ml-ops-data-science-version-control-5935c49d1b76" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">ML Ops: Data Science Version Control</h4>
<div class="elp-description">Data versioning primer for model, data and code.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-05-14/acing_ai-3efe392da9b0c56e9f6f3bbae8e08580.png" alt="ML Ops: Data Science Version Control">
</div>
</a>
</section>
<p></p>
<ul>
<li>Over at <a href="https://www.autoregressed.com/" target="_blank" rel="nofollow noopener noreferrer">AutoRegresed</a>, Jack Pitts shared a
thorough tutorial about using <a href="https://pypi.org/project/pipenv/" target="_blank" rel="nofollow noopener noreferrer">Pipenv</a>, DVC
and Git together. As a trio, this manages dependencies and versions the
working environment, source code, dataset <em>and</em> trained models. It's not only
a cool use case, but a very clear step-by-step explanation that should be easy
to try at home. Stay till the end for a neat trick about deploying a model as
a web service with Pipenv and DVC.</li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://www.autoregressed.com/blog/pipenv-and-dvc-reproducibility-in-data-science" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Pipenv and DVC: Reproducibility in Data Science</h4>
<div class="elp-description">Without standards and tools to easily reproduce models, Data Science teams can become bogged down in technical debt that will make it difficult to deploy and iterate on models. </div>
<div class="elp-link">autoregressed.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-05-14/ar_logo-5df1c108110bd9379c9bfe078c26fb46.jpg" alt="Pipenv and DVC: Reproducibility in Data Science">
</div>
</a>
</section>
<p></p>
<h2 id="nice-tweets" style="position:relative;">Nice tweets<a href="#nice-tweets" aria-label="nice tweets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Last, here are some of our favorite tweets to read this past month:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Data version control from <a href="https://twitter.com/DVCorg">@DVCorg</a> is one of the best new tools I've used in a while. Moving data via the cloud is just a push or pull command away. <br><br>Recommend for anyone who works on multiple machines or shares data with collaborators</p>— Liam Brannigan (@braaannigan) <a href="https://twitter.com/braaannigan/status/1257918525345234949">May 6, 2020</a></blockquote>
<!-- https://twitter.com/josh_wills/status/1249774857614553097 -->
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Getting around to learning <a href="https://twitter.com/DVCorg">@DVCorg</a>, and loving it so far. Versioning data with git-style semantics gives you a lot of functionality with surprisingly little cognitive overhead.</p>— Tim Garvin (@tcgarvin) <a href="https://twitter.com/tcgarvin/status/1258855168436813826">May 8, 2020</a></blockquote>
<p><em>Thank you, thank you very much.</em></p>
<p><img src="https://media.giphy.com/media/gJ2sDSKAQHUCIYUFhx/giphy.gif" alt="Thank You Very Much GIF by The Wiggles"></p>
<p>As always, we want to hear what you're making with DVC and what you're reading.
Tell us in the blog comments, and be in touch on
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. Happy coding!</p>https://dvc.org/blog/dvc-ambassador-program-announcementhttps://dvc.org/blog/dvc-ambassador-program-announcementFri, 08 May 2020 00:00:00 GMT<p>DVC's software can be everywhere, but its developers can’t - that’s why
ambassadors, folks who do outreach and community building around projects they
love, are a key part of the open source community. DVC is starting an ambassador
program to help people who are passionate about our mission get involved.</p>
<p>As the first DVC ambassador, and a
<a href="https://fedoraproject.org/wiki/User:Mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">Fedora ambassador</a> before
that, I can tell you a bit about the role. As a representative of open source
projects, I've participated in lots of events, made friends, and traveled. Every
single time I’ve contributed, I got this nice feeling that it was all worth it.
I believe that if you agree with the core values of the project, a great
relationship lies ahead :).</p>
<p>So what are the core values of DVC, exactly? DVC is founded on the principle of
engineering solutions for making data science and machine learning rigorous and
reproducible. If this matters to you, too, you might be a good fit for our
ambassador program!</p>
<p>As an ambassador, you’ll act as a bridge between DVC in your community. There
are lots of ways to do this, big and small. For example:</p>
<ul>
<li>Write a blog post talking about how you use DVC in your projects</li>
<li>What about creating a network of DVC users and data scientists in your town?
Even though we’re self-isolating now, you can still organize online meetups.
<a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">We already did one!</a>
We help cover costs to organize meetups.</li>
<li>Do you want to talk about DVC at your office, or at a conference? We help
speakers develop talks, and we have some discretionary funds for travel on a
case-by-case basis.</li>
<li>Want to develop a feature for DVC? We welcome contributions to the code base,
even if it’s your first pull request ever.</li>
</ul>
<p>Being an ambassador means getting closer to the team in charge of DVC, but at
the same time, it means going farther to reach people outside the organization-
including people who don’t know about DVC yet, people who need some help getting
started, and people who are already excited about our mission and want to find
meaningful ways to pitch in.</p>
<h2 id="about-iterative-and-dvc" style="position:relative;">About Iterative and DVC<a href="#about-iterative-and-dvc" aria-label="about iterative and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC got started in 2017 as a personal project by Dmitry Petrov (
<a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">we just celebrated our 3rd birthday</a>).
Previously, Dmitry worked at Microsoft as a data scientist and did a PhD in
Computer Science. In 2018, Dmitry teamed up with his co-founder Ivan Shcheklein
(co-founder of <a href="https://tweetedtimes.com/" target="_blank" rel="nofollow noopener noreferrer">The Tweeted Times</a> and
<a href="https://www.sedna.org/" target="_blank" rel="nofollow noopener noreferrer">Sedna</a> contributor) to incorporate Iterative.ai and
grow the project. Iterative.ai is building enterprise tools for collaboration on
ML projects. Currently, Iterative.ai's open source flagship project is Data
Version Control (DVC), an open source version control system for managing
complex workflows, datasets, and models.</p>
<p>Development is ongoing in the core DVC project as well as new ventures into
<a href="https://dvc.org/blog/reimagining-devops-video" target="_blank" rel="nofollow noopener noreferrer">MLOps and Continuous Integration & Delivery (CI/CD)</a>
for data science. The team is small-and-mighty, with developers, engineers, and
data scientists on four continents. The open source community is a huge part of
all Iterative.ai projects; currently, DVC has more than
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">5,000 stars on GitHub</a> and more than 100
individual contributors!</p>
<p>One of DVC’s main principles is adapting existing software engineering practices
to machine learning. For example, DVC is built around Git version control: in an
ML project using DVC, each experiment corresponds to a Git commit. When you
check out any commit, you’ll see the source code as it was when you made the
commit- as expected. But, you’ll also see your datasets as they were and the
exact pipeline of commands you ran in that experiment!</p>
<h2 id="why-become-an-ambassador" style="position:relative;">Why become an ambassador?<a href="#why-become-an-ambassador" aria-label="why become an ambassador permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Like any volunteer position, the main benefit is getting to be involved in a
project you believe in. But there are some perks:</p>
<ul>
<li>Establishing a formal relationship with DVC that can go on your CV/resume.
We'll boost your content on our social channels, too.</li>
<li>Access to support from the DVC team, such as financial resources to organize
your own meetup for local data scientists and ML enthusiasts</li>
<li>Mentorship about crafting blogs and talks, if desired. DVC team members
regularly help people in the community develop their presentations and blogs
for accuracy and clarity</li>
<li>Closer relationships with the DVC team means more chances to participate in
conversations that guide our product decisions.</li>
</ul>
<p>For students and early career professionals, you can learn a lot by interacting
with us! While you can certainly write a blog post or organize a meetup without
being an ambassador, the program is a way to fast-track your learning- you'll
have the creators of DVC helping you understand it well, and helping you
discover features and best practices you might not have known about.</p>
<p>If you're already active in the open source or MLOps community, then becoming an
ambassador is a solid way to cement your relationship with DVC. We'd love to
recognize you for the amazing stuff you already do.</p>
<h2 id="how-to-become-an-ambassador" style="position:relative;">How to become an ambassador<a href="#how-to-become-an-ambassador" aria-label="how to become an ambassador permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you’re interested in becoming an ambassador, send us an email at
<a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">[email protected]</a> with the subject line “I want to be an
ambassador!” Please tell us:</p>
<ul>
<li>A little about yourself and your professional background</li>
<li>Any outreach work you’ve done before</li>
<li>What kind of ambassador activities you’d be most interested in participating
in</li>
</ul>
<p>The program is structured to provide a lot of flexibility, so each ambassador
can do outreach in ways that are personally motivating and enjoyable. There are
a few guidelines:</p>
<ul>
<li>We ask for at least one-year commitment</li>
<li>We ask ambassadors to contribute at least four activities per year, about once
every three months. There's no upper limit to how much you can do!</li>
<li>For your first contribution, we ask for a blog post- this way, we can
collaborate with you to help get all the technical details right. After that,
it’s up to you!</li>
</ul>
<h2 id="some-ideas-to-get-started" style="position:relative;">Some ideas to get started<a href="#some-ideas-to-get-started" aria-label="some ideas to get started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our official ambassador program is just starting, but our community already has
a lot of folks making noise. Here are just a few contributions we admire- we
think they’re pretty cool inspirations for future projects.</p>
<h3 id="blogs-and-tutorials" style="position:relative;">Blogs and tutorials<a href="#blogs-and-tutorials" aria-label="blogs and tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Shareable blogs are one of our most effective outreach strategies. They give
visibility to the author <em>and</em> new ways to use DVC, so it's a win-win.</p>
<ul>
<li><a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" target="_blank" rel="nofollow noopener noreferrer">Remote training with GitLab-CI and DVC</a>,
by Mercel Mikl and Bert Besser (Bert has also organized a DVC meetup in
Berlin)</li>
<li><a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" target="_blank" rel="nofollow noopener noreferrer">Creating a solid Data Science development environment</a>,
by Gabriel dos Santos Goncalves</li>
<li><a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Delivery for Machine Learning</a>,
by Danilo Sato, Arif Wider, and Christoph Windheuser</li>
<li><a href="https://mribeirodantas.xyz/blog/index.php/2020/03/05/r-dvc-and-rmarkdown/" target="_blank" rel="nofollow noopener noreferrer">Manage your Data Science Project in R</a>
was my first blog post about using DVC in an R project!</li>
</ul>
<h3 id="talks" style="position:relative;">Talks<a href="#talks" aria-label="talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Community members have presented at events like PyCon, PyData, and local
meetups.</p>
<ul>
<li><a href="https://www.slideshare.net/AlessiaMarcolini/version-control-for-data-science" target="_blank" rel="nofollow noopener noreferrer">Version control for data science</a>,
by Alessia Marcolini @ PyCon DE & PyData Berlin</li>
<li><a href="https://www.youtube.com/watch?v=rUTlqpcmiQw" target="_blank" rel="nofollow noopener noreferrer">How to easily set up and version control your machine learning pipelines</a>,
by Sarah Diot-Girard & Stephanie Bracaloni @ PyData Amsterdam</li>
<li><a href="https://speakerdeck.com/kurianbenoy/ml-models-and-dataset-versioning" target="_blank" rel="nofollow noopener noreferrer">ML models and dataset versioning</a>,
by Kurian Benoy @ PyCon India</li>
</ul>
<h3 id="code-contributions" style="position:relative;">Code contributions<a href="#code-contributions" aria-label="code contributions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Our GitHub repository has lots of open discussions about potential features- its
a goldmine for ways to pitch in. For example:</p>
<ul>
<li>
<p><a href="https://github.com/elgehelge" target="_blank" rel="nofollow noopener noreferrer">Helge Munk Jacobsen</a> took on an open issue in
our code base about supporting hyperparameter tracking with DVC and made a
pull request to add this feature.</p>
</li>
<li>
<p><a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">Vera Sativa</a> added directory support to the
<a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> function- and she was our 100th contributor, so she won her
own DeeVee the owl.</p>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/78b685e283d679c8ebe518ea17520f6d/39600/odd_with_deevee.png" alt="odd with deevee" title="Vera and team" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera
(center, flashing a peace sign) thanked us with this lovely picture of DeeVee
and her team, <a href="https://odd.co" target="_blank" rel="nofollow noopener noreferrer">Odd Industries</a>.</em></p>
<p>If any of this sounds fun to you, please be in touch over
<a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a> (and you can also reach us on
<a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and our
<a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord Channel</a>). We look forward to
connecting with you!</p>https://dvc.org/blog/dvc-3-years-and-1-0-releasehttps://dvc.org/blog/dvc-3-years-and-1-0-releaseMon, 04 May 2020 00:00:00 GMT<h2 id="3-years-anniversary" style="position:relative;">3 years anniversary!<a href="#3-years-anniversary" aria-label="3 years anniversary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Three years ago on <strong>May 4th, 2017</strong>, I published the
<a href="https://www.kdnuggets.com/2017/05/data-version-control-iterative-machine-learning.html" target="_blank" rel="nofollow noopener noreferrer">first blog post about DVC</a>.
<a href="https://www.reddit.com/r/Python/comments/698ian/dvc_data_scientists_collaboration_and_iterative/" target="_blank" rel="nofollow noopener noreferrer">The first DVC discussion on Reddit</a>.
Until that point, DVC was a private project between
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">myself</a> and <a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan</a>.
Today, things look very different.</p>
<p>Today, DVC gets recognized at professional conferences: people spot our logo,
and sometimes even our faces, and want to chat. There's much more content about
DVC coming from bloggers than from inside our organization. We're seeing more
and more job postings that list DVC as a requirement, and we're showing up in
<a href="https://www.amazon.com/Learn-Python-Building-Science-Applications/dp/1789535360" target="_blank" rel="nofollow noopener noreferrer">data science textbooks</a>.
When we find a new place DVC is mentioned, we celebrate in our Slack - we've
come a long way!</p>
<p>The data science and ML space is fast-paced and vibrant, and we're proud that
DVC is making an impact on discussions about best practices for healthy,
sustainable ML. Every week, we chat with companies and research groups using DVC
to make their teams more productive. We're proud to be part of the growing MLOps
movement: so far, a majority of CI/CD for ML projects are implemented with DVC
under the hood.</p>
<p>I can confidently say that DVC wouldn't have been possible without a lot of help
from our community. Thank you to everyone who has supported us:</p>
<p><strong>DVC core team.</strong> The DVC team has been the force driving our project's
evolution - we've grown from 2 to 12 full-time engineers, developers, and data
scientists. Half of the team is purely focus on DVC while the other half on
related to DVC new projects. We often get feedback about how fast our team
answers user questions - we've been told our user support is one of DVC's
"killer features". It's all thanks to this amazing team.</p>
<p><strong>DVC contributors.</strong> As of today, the DVC code base has
<a href="https://github.com/iterative/dvc/graphs/contributors" target="_blank" rel="nofollow noopener noreferrer">126 individual contributors</a>.
Many of these folks put hours into their code contribution. We're grateful for
their tenacity and generosity.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 156.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5d069d099019190069a5e5789008af9f/87fcf/vera-sativa.png" alt="vera sativa" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera - 100th DVC contributor
<a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">on GitHub</a>.</em></p>
<p><strong>Documentation contributors.</strong> Another
<a href="https://github.com/iterative/dvc.org/graphs/contributors" target="_blank" rel="nofollow noopener noreferrer">124 people contributed</a>
to the <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a> and
<a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">the website</a>. Every time a new person tries out DVC, they
benefit from the hard work that's gone into our docs.</p>
<p><strong>Active community members.</strong> Active DVC users help our team understand and
better anticipate their needs and identify priorities for development. They
share bright ideas for new features, locate and investigate bugs in code, and
welcome and support new users.</p>
<p><strong>People who give DVC a shot.</strong> Today, there are thousands of data scientists,
ML engineers, and developers using DVC on a regular basis. The number of users
is growing every week. Our <a href="http://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a> has almost two
thousand users. Hundreds more connect with us through email and Twitter. To
everyone willing to try out DVC, thank you for the opportunity.</p>
<h2 id="dvc-10-is-the-result-of-3-years-of-learning" style="position:relative;">DVC 1.0 is the result of 3 years of learning<a href="#dvc-10-is-the-result-of-3-years-of-learning" aria-label="dvc 10 is the result of 3 years of learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>All these contributions, big and small, have a collective impact on DVC's
development. I'm happy (and a bit nervous) to announce that a pre-release of a
brand new DVC 1.0 is ready for public beta testing.</p>
<p>You can install the 1.0 pre-release from the master branch in our repo
(instruction <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">here</a>) or through pip:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> <span class="token parameter variable">--pre</span> dvc</span></code></pre></div>
<p>The new DVC is inspired by discussions and contributions from our community -
both fresh ideas and bug reports 😅.</p>
<p>Here are the most significant features we’re excited to be rolling out soon:</p>
<h3 id="run-cache" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">Run cache</a><a href="#run-cache" aria-label="run cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>Learnings:</em> Forcing users to make Git commits for each ML experiment creates
too much overhead.</p>
<p>DVC 1.0 has a "long memory" of DVC commands runs. This means it can identify if
a <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> has already been run and save compute time by returning the cached
result - <em>even if you didn't Git commit that past run</em>.</p>
<p>We added the run-cache with CI/CD systems and other MLOps and DataOps automation
tools in mind. No more auto-commits needed after <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> in the CI/CD system
side.</p>
<h3 id="multi-stage-dvc-files" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">Multi-stage DVC files</a><a href="#multi-stage-dvc-files" aria-label="multi stage dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>Learnings:</em> ML pipelines evolve much faster than data engineering pipelines.</p>
<p>We redesigned the way DVC records data processing stages with metafiles, to make
pipelines more interpretable and editable. All pipeline stages are now saved in
a single metafile, with all stages stored together instead of in separate files.</p>
<p>Data hash values are no longer stored in the pipeline metafile. This improves
human-readability.</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span>
<span class="token key atrule">process</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> ./process_raw_data raw_data.log users.csv
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> raw_data.log
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> process_file
<span class="token punctuation">-</span> click_threshold
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> users.csv
<span class="token key atrule">train</span><span class="token punctuation">:</span>
<span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py
<span class="token key atrule">deps</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> users.csv
<span class="token key atrule">params</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> epochs
<span class="token punctuation">-</span> log_file
<span class="token punctuation">-</span> dropout
<span class="token key atrule">metrics_no_cache</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> summary.json
<span class="token key atrule">metrics</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> logs.csv
<span class="token key atrule">outs</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> model.pkl</code></pre></div>
<h3 id="plots" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">Plots</a><a href="#plots" aria-label="plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>Learnings:</em> Versioning metrics and plots are no less important than data
versioning.</p>
<p>Countless users asked us when we'd support metrics visualizations. Now it's
here: DVC 1.0 introduces metrics file visualization commands, <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a>
and <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a>. DVC plots are powered by the
<a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a> graphic library.</p>
<p>This function is designed not only for showing visualizations based on the
current state of your project, but it can also combine multiple plots from your
Git history in a single chart so you can compare results across commits. Users
can visualize how, for example, their model accuracy in the latest commit
differs from another commit (or even multiple commits).</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march
</span>file:///Users/dmitry/src/plot/logs.csv.html
<span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div>
<p><img src="https://dvc.org/2020-05-04/dvc-plots-092248e6898ab510fc3803efb5e22d9f.svg" alt=""></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march <span class="token punctuation">\</span>
<span class="token parameter variable">-x</span> loss <span class="token parameter variable">--template</span> scatter
</span>file:///Users/dmitry/src/plot/logs.csv.html
<span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div>
<p><img src="https://dvc.org/2020-05-04/dvc-plots-scatter-9cfc6c2078273faa482129d8d1609967.svg" alt=""></p>
<h3 id="data-transfer-optimizations" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3488" target="_blank" rel="nofollow noopener noreferrer">Data transfer optimizations</a><a href="#data-transfer-optimizations" aria-label="data transfer optimizations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>Learnings:</em> In ML projects, data transfer optimization is still the king.</p>
<p>We've done substantial work on optimizing data management commands, such as
<a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a>. Now, based on the amount of data, DVC can
choose an optimal data remote traversing strategy.</p>
<p><a href="https://github.com/iterative/dvc/issues/2147" target="_blank" rel="nofollow noopener noreferrer">Mini-indexes</a> were introduced to
help DVC instantly check data directories instead of iterating over millions of
files. This also speeds up file adding/removing to large directories.</p>
<p>More optimizations are included in the release based on performance bottlenecks
we profiled. More detailed
<a href="https://gist.github.com/pmrowla/338d9645bd05df966f8aba8366cab308" target="_blank" rel="nofollow noopener noreferrer">benchmark report</a>
that shows how many second it takes to run a specific commands on 2M images
directory.</p>
<p><img src="https://dvc.org/2020-05-04/benchmarks-fb3909a1a199bbfdfb5b66b689e2ffb0.svg" alt=""></p>
<h3 id="hyperparameter-tracking" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">Hyperparameter tracking</a><a href="#hyperparameter-tracking" aria-label="hyperparameter tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>Learnings:</em> ML pipeline steps depends only on a subset of config file.</p>
<p>This feature was actually released in the last DVC 0.93 version (see
<a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">params docs</a>. However, it is an
important step to support configuration files and ML experiments in a more
holistic way.</p>
<h3 id="for-more-information-on-the-new-features" style="position:relative;">For more information on the new features…<a href="#for-more-information-on-the-new-features" aria-label="for more information on the new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Each of the big new features and improvements deserve a separate blog post. We
will be posting more - please stay in touch.</p>
<p>I hope our the most active users will find time to check the DVC pre-release
version and provide their feedback. The installation instruction is
<a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">on our website</a>.</p>
<h2 id="5000-github-stars" style="position:relative;">5000 GitHub stars<a href="#5000-github-stars" aria-label="5000 github stars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Activity on our GitHub page has grown organically since the DVC repo went public
on May 4th, 2017. Coincidentally, today, in the 3rd year anniversary we have
reached 5000 starts:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf64a01a292055b72fcf916ef2d6d1f8/39600/5k_github.png" alt="5k github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thank you again to all DVC contributors, community members, and users. Every
piece of your help is highly appreciated and will bring huge benefits to the
entire ecosystem of data and ML projects.</p>
<p>Stay healthy and safe, wherever you are in the world. And be in touch on
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/gsod-ideas-2020https://dvc.org/blog/gsod-ideas-2020Thu, 30 Apr 2020 00:00:00 GMT<p>After a successful experience with the first edition of <strong>Google Season of
Docs</strong> <a href="https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019">in 2019</a>, we're
putting out a call for writers to apply to work with DVC as part of the
<a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">2020 edition</a>. If you want to
write open source software documentation with mentorship from our team, read on.</p>
<p><strong>TLDR</strong>: Skip to <a href="#project-ideas">project ideas</a>.</p>
<p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> has a dedicated docs team and a
<a href="https://dvc.org/doc/user-guide/contributing/docs" target="_blank" rel="nofollow noopener noreferrer">well-defined process</a> for
creating and maintaining our documentation, modeled in part based on our past
GSoD experience. We are happy to share our experience, introduce technical
writers to the world of open source and machine learning best practices, and
work together on improving our documentation.</p>
<h2 id="previous-experience" style="position:relative;">Previous experience<a href="#previous-experience" aria-label="previous experience permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In last year's Season, we matched with prolific writer
<a href="https://github.com/dashohoxha" target="_blank" rel="nofollow noopener noreferrer">Dashamir</a>, who helped us give proper structure
important part of our docs, and address key issues.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">As <a href="https://twitter.com/hashtag/GSoD2019?src=hash&ref_src=twsrc%5Etfw">#GSoD2019</a> is officially over we would like to thank <a href="https://twitter.com/dashohoxha">@dashohoxha</a> for contributing online interactive tutorials <a href="https://t.co/iZKkqmx5pm">https://t.co/iZKkqmx5pm</a> (👈 link or search for Katacoda button on <a href="https://t.co/b8MwcZdY3s">https://t.co/b8MwcZdY3s</a>) 😍 Thank you <a href="https://twitter.com/GoogleOSS">@GoogleOSS</a> team and <a href="https://twitter.com/chenopis">@chenopis</a> for organizing this 🙏! <a href="https://t.co/SGrgtA5J0B">pic.twitter.com/SGrgtA5J0B</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1205203662827483136">December 12, 2019</a></blockquote>
<p>Some of our achievements together were:</p>
<ul>
<li>Reorganized our <a href="https://github.com/iterative/dvc.org/pull/666" target="_blank" rel="nofollow noopener noreferrer">tutorials</a> and
core <a href="https://github.com/iterative/dvc.org/pull/726" target="_blank" rel="nofollow noopener noreferrer">contribution guide</a></li>
<li>Created <a href="https://github.com/iterative/dvc.org/issues/546" target="_blank" rel="nofollow noopener noreferrer">interactive lessons</a>
on <a href="https://www.katacoda.com/dvc" target="_blank" rel="nofollow noopener noreferrer">Katacoda</a></li>
<li>Docs <a href="https://github.com/iterative/dvc.org/pull/734" target="_blank" rel="nofollow noopener noreferrer">cleanup</a></li>
<li>Suggested the creation of a
<a href="https://github.com/iterative/dvc.org/issues/563" target="_blank" rel="nofollow noopener noreferrer">How To</a> section for our docs</li>
<li>Other
<a href="https://github.com/iterative/dvc.org/pulls?q=is%3Apr+is%3Aclosed+author%3Adashohoxha" target="_blank" rel="nofollow noopener noreferrer">contributions</a></li>
</ul>
<p>Another collaborator we connected with via GSoD’19 was an amazing student
intern, <a href="https://github.com/algomaster99" target="_blank" rel="nofollow noopener noreferrer">Aman</a>. He helped us address
<a href="https://github.com/iterative/dvc.org/pulls?q=is%3Apr+author%3Aalgomaster99+is%3Aclosed" target="_blank" rel="nofollow noopener noreferrer">dozens of tickets</a>
related to our Node.js docs web app. For example:</p>
<ul>
<li>
<p>Contributed to our
<a href="https://github.com/iterative/dvc.org/pull/315" target="_blank" rel="nofollow noopener noreferrer">command reference</a> and
<a href="https://github.com/iterative/dvc.org/pull/366" target="_blank" rel="nofollow noopener noreferrer">user guide</a>, and created a
much needed
<a href="https://github.com/iterative/dvc.org/pull/317" target="_blank" rel="nofollow noopener noreferrer">documentation contribution</a>
guide</p>
</li>
<li>
<p><a href="https://github.com/iterative/dvc.org/pull/328" target="_blank" rel="nofollow noopener noreferrer">Formatted</a> the source code of
our docs and established an
<a href="https://github.com/iterative/dvc.org/pull/386" target="_blank" rel="nofollow noopener noreferrer">automated mechanism</a> to
enforce pretty formatting going forward</p>
</li>
<li>
<p>Implemented super useful hovering tooltips based on a special
<a href="https://github.com/iterative/dvc.org/pull/431" target="_blank" rel="nofollow noopener noreferrer">glossary</a>:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 595px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/66e1324a2a8352f1b3605e3fa6b90731/39600/tooltip.png" alt="tooltip" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Toolip in the <a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote</code></a>
command reference</em></p>
</li>
</ul>
<h3 id="community-outreach" style="position:relative;">Community outreach<a href="#community-outreach" aria-label="community outreach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>More positive results of the program included talks and meetups organized by our
open source contributors, with our mentorship:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 604.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/263792dadd1fa5d01ed810e1d7a09bb8/39600/SciPy_India_Aman.png" alt="SciPy India Aman" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Our intern Aman took a
several-hour long train ride to
<a href="https://static.fossee.in/scipy2019/SciPyTalks/SciPyIndia2019%5FS011%5FStoring%5Fa%5Ffew%5Fversions%5Fof%5Fa%5F5GB%5Ffile%5Fin%5Fa%5Fdata%5Fscience%5Fproject%5F20191130.mp4" target="_blank" rel="nofollow noopener noreferrer">talk</a>
at <a href="https://scipy.in/2019" target="_blank" rel="nofollow noopener noreferrer">SciPy India 2019</a>.</em></p>
<p>Another star contributor who found our project via GSoD,
<a href="https://github.com/kurianbenoy" target="_blank" rel="nofollow noopener noreferrer">Kurian</a>, closed
<a href="https://github.com/iterative/dvc.org/issues?q=is%3Aissue+kurianbenoy" target="_blank" rel="nofollow noopener noreferrer">several tickets</a>,
produced a DVC intro tutorial in
<a href="https://www.kaggle.com/kurianbenoy/introduction-to-data-version-control-dvc" target="_blank" rel="nofollow noopener noreferrer">Kaggle</a>
and
<a href="https://colab.research.google.com/drive/1O1XmUZ8Roj1dFxWTrpE55_A7lVkWfG04" target="_blank" rel="nofollow noopener noreferrer">Colab</a>,
and ended up giving a talk in
<a href="https://in.pycon.org/cfp/2019/proposals/machine-learning-model-and-dataset-versioning~dRqRb/" target="_blank" rel="nofollow noopener noreferrer">PyCon India</a>:</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Ipzf6oQqQpo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>He also covered DVC for the
<a href="https://kurianbenoy.github.io/2019-11-03-Devsprints%5Fexperience/" target="_blank" rel="nofollow noopener noreferrer">Devsprints</a>
of <a href="https://enotice.vtools.ieee.org/public/50448" target="_blank" rel="nofollow noopener noreferrer">MEC.conf</a></p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Thank you <a href="https://twitter.com/DVCorg">@DVCorg</a> for participating in the Devsprints, by <a href="https://twitter.com/FossMec">@FossMEC</a> and <a href="https://twitter.com/excelmec">@excelmec</a>. We had <a href="https://twitter.com/shcheklein">@shcheklein</a> who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and <a href="https://twitter.com/kurianbenoy2">@kurianbenoy2</a> helped participants get started to contributing to the project.</p>— FOSS MEC (@FossMec) <a href="https://twitter.com/FossMec/status/1192866498324254720">November 8, 2019</a></blockquote>
<p>Yet another outstanding contributor,
<a href="https://twitter.com/explorer_07" target="_blank" rel="nofollow noopener noreferrer">Nabanita</a>, ended up organizing a DVC-themed
hackathon later that year:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Our open source event Hacktoberfest-themed meet-up was a success. Thanks to <a href="https://twitter.com/DVCorg">@DVCorg</a> and it's mentors for all the hard work. <br>Some of our attendees made their first PR on DVC and got them merged. Kudos to the team! <br>PS: 🍕 was the second best thing of the evening. <a href="https://t.co/zAWC0TVlPd">pic.twitter.com/zAWC0TVlPd</a></p>— Programming Society IIIT-Bh (@psociiit) <a href="https://twitter.com/psociiit/status/1185150096792535040">October 18, 2019</a></blockquote>
<h2 id="prerequisites-to-apply" style="position:relative;">Prerequisites to apply<a href="#prerequisites-to-apply" aria-label="prerequisites to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Besides the general requirements to apply to Google Season of Docs, there are a
few skills we look for in applicants.</p>
<ol>
<li>
<p><strong>Clear English writing.</strong> We strive express the concepts, processes, and
details around DVC clearly, correctly, and completely. We use general and
friendly wording as much as possible and pay close attention to consistency
in our terminology. Our team will help with copy editing.</p>
</li>
<li>
<p><strong>Command line experience.</strong> <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is a command line
tool that builds on top of <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, so being able to play
with it and test the features will be very useful. Creating and managing
files, GNU/Linux commands, file and permission administration are desired
skills.</p>
</li>
<li>
<p><strong>People skills.</strong> We put a high value on communication: the ability to
discuss ideas, explain your goals, report progress, and work kindly with more
or less technical teammates.</p>
</li>
</ol>
<p>If you like our mission but aren't sure if you're sufficiently prepared, please
be in touch anyway. We'd love to hear from you.</p>
<h2 id="project-ideas" style="position:relative;">Project ideas<a href="#project-ideas" aria-label="project ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Below are several project ideas that are an immediate priority for the DVC docs
team. We welcome technical writers to create their own proposals, even if they
differ from our ideas. Most projects will be mentored primarily by our lead
technical writer, <a href="https://github.com/jorgeorpinel" target="_blank" rel="nofollow noopener noreferrer">Jorge</a>.</p>
<ol>
<li>
<p><strong>"How To" section.</strong> Other than our
<a href="https://dvc.org/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">use cases</a>, we still lack a good place to
answer common questions in our docs (think FAQ). We have compiled
<a href="https://github.com/iterative/dvc.org/issues/899" target="_blank" rel="nofollow noopener noreferrer">set of topics</a> that we
think would be best explained in a new <strong>How To</strong> section for this purpose.</p>
<p>This project would imply relocating bits and pieces of info from existing
docs into new how-tos, as well as writing significant new material to
complete them. Expanding on our
<a href="https://dvc.org/doc/user-guide/troubleshooting" target="_blank" rel="nofollow noopener noreferrer">troubleshooting</a> page would
probably go well as part of this project as well.</p>
<p><em>Difficulty rating:</em> Beginner-Medium<br><br></p>
</li>
<li>
<p><strong>DVC 1.0 docs.</strong> We are soon to release DVC 1.0.0! This version brings some
significant changes that for the first time in our
<a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">release history</a> are not
completely backward-compatible. We expect that fully updating all our
previous docs will take a few months, and you could help us with this! The
main new features are listed below.</p>
<blockquote>
<p>UPDATE: See <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">post</a> about
the release! And corresponding docs
<a href="https://github.com/iterative/dvc.org/issues/1255" target="_blank" rel="nofollow noopener noreferrer">epic</a> task</p>
</blockquote>
<ul>
<li>A
<a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">multi-stage <em>pipelines file</em></a>
that partially substitutes
<a href="https://dvc.org/doc/user-guide/dvc-files" target="_blank" rel="nofollow noopener noreferrer">DVC files</a></li>
<li>Separation between
<a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">scalar vs. continuous metrics</a>,
and new commands to visualize them, such as <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a></li>
<li>A new <a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">run cache</a> that
automatically saves experiment checkpoints between commits</li>
</ul>
<p><em>Difficulty rating:</em> Beginner-Medium<br><br></p>
</li>
<li>
<p><strong>Video tutorials.</strong> Written documentation is great, but other media can also
be important for our organization to reach a wide variety of learners.
Expanding to video is also a core part of our developer advocacy strategy.</p>
<p>One of DVC's priorities for this year is creating a library of video
tutorials ranging from short explanations of basic DVC functions to more
advanced use cases. You could assist in writing the scripts or even take the
lead producing some videos, so image/video editing skills would come in handy
(optional).</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/70a290d5570858cf3528fbe72f6070a9/39600/Discord_user_video_tutorials.png" alt="Discord user video tutorials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Video
tutorials are a common request by users in our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">chat</a>.</em></p>
<p><strong>Mentor</strong>: <a href="https://github.com/elleobrien" target="_blank" rel="nofollow noopener noreferrer">Elle</a></p>
<p><em>Difficulty rating:</em> Beginner-Medium<br><br></p>
</li>
<li>
<p><strong>Interactive guides.</strong> Many of our docs include command line examples to
illustrate how DVC works. In some cases these are full guides we want people
to be able to follow by copying commands into their terminals. This has a few
challenges: mainly keeping the rest of the document maintainable, brief, and
easy to read; and supporting people on all platforms (Mac, Windows, Linux).</p>
<p>So we started extracting some of the command examples into interactive
<a href="https://www.katacoda.com/dvc" target="_blank" rel="nofollow noopener noreferrer">Katacoda scenarios</a> to match certain docs,
however they are in need of maintenance and completion, as well as being
embedded into the corresponding pages per
<a href="https://github.com/iterative/dvc.org/issues/670" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p>
<p>This may involve working with our front-end team or, preferably, having some
Javascript coding experience.</p>
<p><em>Difficulty rating:</em> Medium-Advanced</p>
</li>
<li>
<p><strong>Javascript engine UI/UX.</strong> Our website has custom
<a href="https://github.com/iterative/dvc.org/tree/main/src" target="_blank" rel="nofollow noopener noreferrer">source code</a> we've
developed over the years to host our landing pages, docs, and blog all in a
high-performance, advanced static site (Node.js, Gatsby, React, Typescript).
We have several goals to further improve the usability and structure of our
site, such as:</p>
<ul>
<li>Creating a
<a href="https://github.com/iterative/dvc.org/issues/1073" target="_blank" rel="nofollow noopener noreferrer">special docs home page</a></li>
<li>Improving <a href="https://github.com/iterative/dvc.org/issues/808" target="_blank" rel="nofollow noopener noreferrer">mobile menus</a></li>
<li>Better navigation sidebar
<a href="https://github.com/iterative/dvc.org/issues/753" target="_blank" rel="nofollow noopener noreferrer">highlighting</a> and
<a href="https://github.com/iterative/dvc.org/issues/1198" target="_blank" rel="nofollow noopener noreferrer">positioning</a></li>
<li>Other
<a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Adoc-engine" target="_blank" rel="nofollow noopener noreferrer">doc-engine</a>
and
<a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Ablog-engine" target="_blank" rel="nofollow noopener noreferrer">blog-engine</a>
issues</li>
</ul>
<p><em>Difficulty rating:</em> Medium-Advanced<br><br></p>
</li>
<li>
<p><strong>SEO/ Site Analytics.</strong> Our current website analytics are somewhat basic. We
will need to have a clear strategy to follow and improve our Search Engine
results (with meta content, media optimization,
<a href="https://github.com/iterative/dvc.org/issues?q=is%3Aissue+is%3Aopen+seo" target="_blank" rel="nofollow noopener noreferrer">etc.</a>),
as well as to understand the behavior of our users to improve their
experience. The specifics of the project are left for the applicant to
suggest! This should be relatively simple for someone with proven experience
in SEO or website QA.</p>
<p>What tools should we employ? (e.g. Google Analytics, etc.) What trends and
reports do we need to focus on? What kinds of users do we have and what
interaction flows do they each follow? Can we semi-identify these users
and/or cross-examine their data with DVC
<a href="https://dvc.org/doc/user-guide/analytics" target="_blank" rel="nofollow noopener noreferrer">usage analytics</a>? Let's come up
with a plan to answer these and other related questions!</p>
<p><em>Difficulty rating:</em> Beginner-Medium<br><br></p>
</li>
</ol>
<blockquote>
<p>For more inspiration, feel free to review our
<a href="https://github.com/iterative/dvc.org/labels/epic" target="_blank" rel="nofollow noopener noreferrer">epics</a> and other open docs
<a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Adoc-content+" target="_blank" rel="nofollow noopener noreferrer">issues</a>.</p>
</blockquote>
<h2 id="if-youd-like-to-apply" style="position:relative;">If you'd like to apply<a href="#if-youd-like-to-apply" aria-label="if youd like to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Please refer to the
<a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a>
application guides for specifics of the program. Writers looking to know more
about DVC, and our worldwide community of contributors, will learn most by
visiting our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord chat</a>,
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a>, and
<a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Forum</a>. We are available to discuss project proposals
from interested writers and can be reached by <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a> or
on Discord.</p>https://dvc.org/blog/april-20-community-gemshttps://dvc.org/blog/april-20-community-gemsThu, 16 Apr 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p>
<h3 id="q-how-can-i-view-and-download-files-that-are-being-tracked-by-dvc-in-a-repository" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/698815826870009868" target="_blank" rel="nofollow noopener noreferrer">How can I view and download files that are being tracked by DVC in a repository?</a><a href="#q-how-can-i-view-and-download-files-that-are-being-tracked-by-dvc-in-a-repository" aria-label="q how can i view and download files that are being tracked by dvc in a repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To list the files that are currently being tracked in a project repository by
DVC and Git, you can use <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a>. This will display the contents of that
repository, including <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. To download the contents corresponding to a
particular <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file, use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>:</p>
<p>Let's consider an example using both functions. Assume we're working with DVC's
data registry example repository. To list the files present, run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc list</span> <span class="token parameter variable">-R</span> https://github.com/iterative/dataset-registry
</span>.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
...</code></pre></div>
<p>Note that the <code>-R</code> flag, which enables <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a> to display the contents of
directories inside the repository. Now assume you want to download <code>data.xml</code>,
which we can see is being tracked by DVC. To download the dataset to your local
workspace, you would then run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/dataset-registry get-started/data.xml</span></code></pre></div>
<p>For more examples and information,
<a href="https://dvc.org/doc/command-reference/list#list" target="_blank" rel="nofollow noopener noreferrer">see the documents</a> for
<a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a> and for <a href="https://dvc.org/doc/command-reference/get" target="_blank" rel="nofollow noopener noreferrer"><code>dvc get</code></a>.</p>
<h3 id="q-im-setting-up-cloud-remote-storage-for-dvc-and-id-like-to-forbid-dvc-gc---cloud-so-users-cant-accidently-delete-files-in-the-remote-will-it-be-sufficient-to-restrict-deletion-in-the-remotes-settings" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/698116671298076672" target="_blank" rel="nofollow noopener noreferrer">I'm setting up cloud remote storage for DVC and I'd like to forbid <code>dvc gc --cloud</code> so users can't accidently delete files in the remote. Will it be sufficient to restrict deletion in the remote's settings?</a><a href="#q-im-setting-up-cloud-remote-storage-for-dvc-and-id-like-to-forbid-dvc-gc---cloud-so-users-cant-accidently-delete-files-in-the-remote-will-it-be-sufficient-to-restrict-deletion-in-the-remotes-settings" aria-label="q im setting up cloud remote storage for dvc and id like to forbid dvc gc cloud so users cant accidently delete files in the remote will it be sufficient to restrict deletion in the remotes settings permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You're right to be careful, because <a href="https://dvc.org/doc/command-reference/gc#--cloud"><code>dvc gc --cloud</code></a> can be dangerous in the
wrong hands- it'll remove any unused files in your remote (for more info,
<a href="https://dvc.org/doc/command-reference/gc#gc" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>). To prevent users
from having this power, setting your bucket policy to block object deletions
should do the trick. How to do this will depend on your cloud storage provider-
we found some relevant docs for
<a href="https://cloud.google.com/iam/docs/understanding-roles#cloud_storage_roles" target="_blank" rel="nofollow noopener noreferrer">GCP</a>,
<a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html" target="_blank" rel="nofollow noopener noreferrer">S3</a>,
and
<a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad" target="_blank" rel="nofollow noopener noreferrer">Azure</a>.
For the full list of supported remote storage types,
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">see here</a>.</p>
<h3 id="q-my-team-is-interested-in-dvc-and-we-have-all-of-our-data-in-remote-storage-do-we-need-to-install-a-centralised-enterprise-version-of-dvc-on-a-dedicated-server-and-do-we-have-to-also-have-a-github-repository" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/692524884701478992" target="_blank" rel="nofollow noopener noreferrer">My team is interested in DVC, and we have all of our data in remote storage. Do we need to install a centralised enterprise version of DVC on a dedicated server? And do we have to also have a GitHub repository?</a><a href="#q-my-team-is-interested-in-dvc-and-we-have-all-of-our-data-in-remote-storage-do-we-need-to-install-a-centralised-enterprise-version-of-dvc-on-a-dedicated-server-and-do-we-have-to-also-have-a-github-repository" aria-label="q my team is interested in dvc and we have all of our data in remote storage do we need to install a centralised enterprise version of dvc on a dedicated server and do we have to also have a github repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There's no need for a DVC server. Our remote storage works on top of
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">most kinds of cloud storage by default</a>,
including S3, GCP, Azure, Google Drive, and Aliyun, with no additional
infrastructure required. As for GitHub (or BitBucket, or GitLab, etc.), this is
only needed if you're interested in sharing your project with others over that
channel. We <em>like</em> sharing projects on GitHub, but you don't have to. Any Git
repository, even a local one, will do.</p>
<p>So a "minimal" DVC project for you might consist of a local workspace with Git
enabled (which you <em>do</em> need), a local Git repository, and your S3 remote
storage. Check out our
<a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">use cases</a> to
see some examples of infrastructure and workflow for teams.</p>
<h3 id="q-could-there-be-any-issues-with-concurrent-dvc-push-es-to-the-same-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/680053750320332800" target="_blank" rel="nofollow noopener noreferrer">Could there be any issues with concurrent <code>dvc push</code>-es to the same remote?</a><a href="#q-could-there-be-any-issues-with-concurrent-dvc-push-es-to-the-same-remote" aria-label="q could there be any issues with concurrent dvc push es to the same remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are a few ways for concurrency to occur: multiple jobs running in parallel
on the same machine, or different users on different machines. But in any case,
the answer is the same: there's nothing to worry about! When pushing a file to a
DVC remote, all operations are non-destructive and atomic.</p>
<h3 id="q-how-do-i-only-download-part-of-my-remote-repository-for-example-i-only-need-the-final-output-of-my-pipeline-not-the-raw-data-or-intermediate-steps" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/696751934777852004" target="_blank" rel="nofollow noopener noreferrer">How do I only download part of my remote repository? For example, I only need the final output of my pipeline, not the raw data or intermediate steps.</a><a href="#q-how-do-i-only-download-part-of-my-remote-repository-for-example-i-only-need-the-final-output-of-my-pipeline-not-the-raw-data-or-intermediate-steps" aria-label="q how do i only download part of my remote repository for example i only need the final output of my pipeline not the raw data or intermediate steps permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We support granular operations on DVC project repositories! Say your project's
DVC remote contains several <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files corresponding to different stages of
your pipeline: <code>0_process_data.dvc</code>, <code>1_split_test_train.dvc</code>, and
<code>2_train_model.dvc</code>. If you're only interested in the files output by the final
stage of the pipeline (<code>2_train_model.dvc</code>), you can run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> process_data_stage.dvc</span></code></pre></div>
<p>You can also use <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> at the level of individual files. This might be
needed if your DVC pipeline file creates 10 outputs, for example, and you only
want to pull one (say, <code>model.pkl</code>, your trained model) from remote DVC storage.
You'd simply run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> model.pkl</span></code></pre></div>
<h3 id="q-how-can-i-remove-a-dvc-file-but-keep-the-associated-files-in-my-workspace" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/689827778358673469" target="_blank" rel="nofollow noopener noreferrer">How can I remove a <code>.dvc</code> file, but keep the associated files in my workspace?</a><a href="#q-how-can-i-remove-a-dvc-file-but-keep-the-associated-files-in-my-workspace" aria-label="q how can i remove a dvc file but keep the associated files in my workspace permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Sometimes, you realize you don't want to put a file under DVC tracking after
all. That's okay, easy to fix. Simply remove the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file like any other-
<code>rm <file>.dvc</code>. DVC will then stop tracking the file, and the associated target
file will still be in your local workspace. Note that the file will still be in
your
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">DVC cache</a>
unless you clear it with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>.</p>
<h3 id="q-im-trying-to-move-a-stage-file-with-dvc-move-but-im-getting-an-error-whats-going-on" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/685125650901630996" target="_blank" rel="nofollow noopener noreferrer">I'm trying to move a stage file with <code>dvc move</code>, but I'm getting an error. What's going on?</a><a href="#q-im-trying-to-move-a-stage-file-with-dvc-move-but-im-getting-an-error-whats-going-on" aria-label="q im trying to move a stage file with dvc move but im getting an error whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The <a href="https://dvc.org/doc/command-reference/move"><code>dvc move</code></a> command is used to rename a file or directory and simultaneously
modify its corresponding DVC file. It's handy so you don't rename a file in your
local workspace that's under DVC tracking without updating DVC to the change
(see an <a href="https://dvc.org/doc/command-reference/move#description" target="_blank" rel="nofollow noopener noreferrer">example here</a>).
The function doesn't work on
<a href="https://dvc.org/doc/tutorials/pipelines#define-stages" target="_blank" rel="nofollow noopener noreferrer">"stage files"</a> from DVC
pipelines. There's not currently an easy way to safely move <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files,
and it's an
<a href="https://github.com/iterative/dvc/issues/1489" target="_blank" rel="nofollow noopener noreferrer">open issue we're working on</a>.
Until then, you can manually update <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, or make a new one in the desired
location.</p>
<h3 id="q-i-just-starting-using-dvc-and-noticed-that-when-i-dvc-push-files-to-remote-cloud-storage-the-directory-in-my-remote-looks-like-my-dvc-cache-not-my-local-workspace-directory-is-this-right" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/693740598498426930" target="_blank" rel="nofollow noopener noreferrer">I just starting using DVC and noticed that when I <code>dvc push</code> files to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?</a><a href="#q-i-just-starting-using-dvc-and-noticed-that-when-i-dvc-push-files-to-remote-cloud-storage-the-directory-in-my-remote-looks-like-my-dvc-cache-not-my-local-workspace-directory-is-this-right" aria-label="q i just starting using dvc and noticed that when i dvc push files to remote cloud storage the directory in my remote looks like my dvc cache not my local workspace directory is this right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yep, that's exactly how it should be! In order to provide deduplication and some
other optimizations, your DVC remote's directory structure will mirror the DVC
cache (which is by default in your local workspace under <code>.dvc/cache</code>).
Effectively, DVC uses your Git repository to store DVC files, which are keys for
cache files on your remote. So looking inside your remote won't be particularly
enlightening if you're looking for human-readable filenames- the file names will
look like hashes (because, well, they are). Luckily, DVC handles all the
conversions between the filenames in your local workspace and these hashes.</p>
<p>To get some more intuition about this, check out some of our
<a href="https://dvc.org/doc/user-guide/dvc-internals" target="_blank" rel="nofollow noopener noreferrer">docs</a> about how DVC organizes
files.</p>https://dvc.org/blog/april-20-dvc-heartbeathttps://dvc.org/blog/april-20-dvc-heartbeatMon, 06 Apr 2020 00:00:00 GMT<p>Welcome to the April Heartbeat, our
<a href="https://dvc.org/blog/tags/heartbeat" target="_blank" rel="nofollow noopener noreferrer">monthly roundup of cool happenings</a>, good
reads and other bright spots in our community.</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>Adapting to the pandemic.</strong> Although the world seems different than when we
posted last month, the DVC community is steady and strong. As a predominantly
distributed company, we've been developing our infrastructure for remote work
from the get-go. It isn't always <em>easy</em> to schedule an all-hands meeting across
9 time zones but we make it work. This experience has prepared us well for the
COVID-19 pandemic: although there are new challenges (like caring for families
while working from home) we've been able to weather the transition to fully
remote work relatively well.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 605px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6203b39de7f66012048047cb492129ac/03346/laptop_on_boat.jpg" alt="laptop on boat" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Before social distancing
started, DVC technical writer Jorge Orpinel Pérez has worked from a canoe. Check
out more photos from his workations
<a href="https://www.instagram.com/workationer/" target="_blank" rel="nofollow noopener noreferrer">on Instagram</a>.</em></p>
<p><strong>DVC sponsors DivOps.</strong> In a time when many conferences are going remote out of
necessity, we were fortunate to be part of an <em>intentionally</em> remote conference
this month! We sponsored <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">DivOps</a>, a fully-online meeting
led by women in DevOps. The DivOps lineup included speakers from GitHub,
DropBox, Gremlin and more. DVC data scientist Elle (that's me!) gave a
ten-minute talk about MLOps and CI/CD, so
<a href="https://dvc.org/blog/reimagining-devops-video" target="_blank" rel="nofollow noopener noreferrer">please check out the video</a>.
Another very relevant talk was from Anna Petrovicheva, CEO of
<a href="http://xperience.ai/" target="_blank" rel="nofollow noopener noreferrer">Xperience AI</a>; Anna
<a href="https://youtu.be/8nwpCQufeE0" target="_blank" rel="nofollow noopener noreferrer">spoke about her team's development workflow for deep learning projects</a>
and gave a clear overivew of how they use DVC.</p>
<p><strong>DVC on the airwaves.</strong> In early March, Elle was interviewed on an episode of
<a href="https://www.interviewquery.com/tag/podcast/" target="_blank" rel="nofollow noopener noreferrer">The Data Stream podcast</a> about a
DVC data science project,
<a href="https://dvc.org/blog/a-public-reddit-dataset" target="_blank" rel="nofollow noopener noreferrer">building a public dataset of posts</a>
from the "Am I the Asshole?" subreddit.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.interviewquery.com/blog-who-is-the-asshole/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The Data Stream #3 - Who is the A-hole? With Elle</h4>
<div class="elp-description">Ever wonder if it's possible to train a model to discover whether your friends are assholes or not? Today Elle comes on the show to talk about her project building a classifier to predict the results from reddit's hottest advice community: Am I the Asshole (or AITA for short).</div>
<div class="elp-link">interviewquery.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-04-06/data_stream-1b5639c7a93df053471157dd03ff1852.png" alt="The Data Stream #3 - Who is the A-hole? With Elle">
</div>
</a>
</section>
<p></p>
<h2 id="new-releases" style="position:relative;">New releases<a href="#new-releases" aria-label="new releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This month, DVC has
<a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">released some new features</a> and
updates:</p>
<ul>
<li>Did you know you can use Google Drive for remote storage with DVC? We've been
hard at work delivering the best performance with Google Drive and are
thrilled to invite users to try it out. Brand new
<a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#setup-a-google-drive-dvc-remote" target="_blank" rel="nofollow noopener noreferrer">docs</a>
explain how to get started.</li>
<li>We're introducing the <code>metrics diff</code> functionality, which lets you compare
metrics from different commits side-by-side
(<a href="https://dvc.org/doc/command-reference/metrics/diff" target="_blank" rel="nofollow noopener noreferrer">check out the docs</a> to
learn more)</li>
<li>Windows users, we are here for you. Contributor
<a href="https://github.com/rxxg" target="_blank" rel="nofollow noopener noreferrer">rxxg</a> helped us get better performance on copy
operations in Windows.</li>
</ul>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>DVC and R working together</strong> One of our favorite blogs this month came from
Marcel Ribeiro-Dantas, a developer and PhD student at the
<a href="https://institut-curie.org/" target="_blank" rel="nofollow noopener noreferrer">Institut Curie</a>. Marcel wrote about using DVC to
manage projects in R, particularly defining and versioning pipelines of data
processing and analysis that can be reproduced easily. While DVC is language
agnostic, much of our user content has been Python-centric, so it's exciting to
see a detailed post for the R-using data scientist (for more about R with DVC,
see
<a href="https://dvc.org/blog/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Marija Ilić's post</a>)!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://mribeirodantas.xyz/blog/index.php/2020/03/05/r-dvc-and-rmarkdown/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Manage your Data Science Project in R</h4>
<div class="elp-description">A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.</div>
<div class="elp-link">mribeirodantas.xyz</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-04-06/marcel-6cb50d09d344473e4dc1f4b30ceff3d1.jpeg" alt="Manage your Data Science Project in R">
</div>
</a>
</section>
<p></p>
<p>Also, Marcel recently gave an interview on
<a href="https://medium.com/data-hackers/health-data-e-o-coronav%C3%ADrus-data-hackers-podcast-22-2b059d460cb1" target="_blank" rel="nofollow noopener noreferrer">The Data Hackers Podcast</a>,
a Portuguese-language show. Listen for a shout-out about DVC!</p>
<p><strong>DVC is in another book!</strong> Last month we reported that DVC is part of a Packt
book,
<a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" target="_blank" rel="nofollow noopener noreferrer">"Learn Python by Building Data Science Applications"</a>.
This month, DVC got a mention in a just-released O'Reilly book,
<a href="https://www.oreilly.com/library/view/building-machine-learning/9781492053187/" target="_blank" rel="nofollow noopener noreferrer">"Building Machine Learning Pipelines"</a>
by Hannes Hapke and Catherine Nelson.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.oreilly.com/library/view/building-machine-learning/9781492053187/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Building Machine Learning Pipelines</h4>
<div class="elp-description">Automating Model Life Cycles with TensorFlow</div>
<div class="elp-link">oreilly.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-04-06/oreilly-966c0911a0b738fecd45cd73637ed540.jpeg" alt="Building Machine Learning Pipelines">
</div>
</a>
</section>
<p></p>
<p><strong>Some more links we like.</strong> Here are a few other discussions that have caught
our attention.</p>
<ul>
<li>
<p><strong>MLOps can be fun.</strong> Jeroen France's blog, "MLOps: Not as boring as it
sounds!", reads like a "coming of age" story about embracing engineering as a
data scientist. It's part-motivational, part tutorial- definitely worth a
read. Here's a sample:</p>
<blockquote>
<p>No-one wants to baby-sit, maintain, and troubleshoot their own models once
they are in production. Every data scientist secretly hopes they can pawn
that job off to an engineering team, or maybe an intern, right? Well, in
fact MLOps is going to make your data science life a lot better.</p>
</blockquote>
</li>
<li>
<p><strong>Leveling up your Jupyter notebooks.</strong> In a series called
<a href="https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-notebooks-in-2020-part-2/" target="_blank" rel="nofollow noopener noreferrer">"How to Use Jupyter Notebooks in 2020"</a>,
Lj Miranda discusses how to use Jupyter Notebooks in a mature software
development workflow. He makes several recommendations for tools, including
DVC.</p>
</li>
<li>
<p><strong>Reddit discussion about CI/CD</strong> When we shared around our DivOps conference
presentation on Reddit, some
<a href="https://www.reddit.com/r/MachineLearning/comments/fshh9p/p_a_talk_about_adapting_cicd_systems_for_ml_full/" target="_blank" rel="nofollow noopener noreferrer">great discussion happened</a>.
We chatted about how CI/CD might work for data scientists, who often begin a
project with a phase of rapid exploration, and what version control for ML
could look like without Git.</p>
</li>
<li>
<p><strong>Smashing the data monolith.</strong> Engineer Juan López López wrote a blog called
<a href="https://medium.com/packlinkeng/a-complete-guide-about-how-to-break-the-data-monolith-caa2ab2d01f6" target="_blank" rel="nofollow noopener noreferrer">"A complete guide about how to break the data monolith"</a>,
which is a neat manifesto about treating infrastructure <em>and</em> data as code.
It's got nice coverage of DVC, code examples, and some deeply enjoyable
artwork.</p>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 527px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f6bccc332f6efbef0e58e9e349ad59ab/03346/monolith.jpg" alt="monolith" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>From Juan Juan López López's
<a href="https://medium.com/packlinkeng/a-complete-guide-about-how-to-break-the-data-monolith-caa2ab2d01f6" target="_blank" rel="nofollow noopener noreferrer">blog</a>.</em></p>
<p>Thanks for reading. As always, let us know what you're making with DVC and what
links are catching your interest in the blog comments, on
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. Be safe and be in touch!</p>https://dvc.org/blog/reimagining-devops-videohttps://dvc.org/blog/reimagining-devops-videoTue, 31 Mar 2020 00:00:00 GMT<p>Last week, DVC was part of <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">DivOps</a>, a fully remote
conference led by women in DevOps. DevOps, to the newly anointed, is a
discipline bringing together strong software engineering practices with speedy
development cycles. As machine learning is finding its way into just about
<em>every</em> area of research and development, we're going to need to come up with
some conventions and tools for integrating machine learning and big data with
software development. This growing field is called
<a href="https://towardsdatascience.com/the-rise-of-the-term-mlops-3b14d5bd1bdb" target="_blank" rel="nofollow noopener noreferrer">MLOps</a>.</p>
<p>I gave a lightning talk about how we'll have to rethink our software development
practices in the age of machine learning. It's got a focus on
<a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">CI/CD</a>, a way of structuring
workflows that we think can streamline exchanges between data scientists and
software engineers. And, it's got fuzzy animals. Check it out here:</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/0MDrZpO_7Q4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>If you liked this, you'll also want to check out the next talk in the DivOps
playlist by
<a href="https://www.linkedin.com/in/anna-petrovicheva-44b24673/" target="_blank" rel="nofollow noopener noreferrer">Anna Petrovicheva</a>,
Founder and CEO of Xperience AI. Anna's talk goes deeper into developing best
practices for software engineering with deep learning.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/8nwpCQufeE0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>All the talks from DivOps are
<a href="https://www.youtube.com/playlist?list=PLVeJCYrrCemgbA1cWYn3qzdgba20xJS8V" target="_blank" rel="nofollow noopener noreferrer">available online now</a>,
so please check out the YouTube channel. And stay tuned on our blog for more
CI/CD discussions coming soon…</p>https://dvc.org/blog/march-20-community-gemshttps://dvc.org/blog/march-20-community-gemsThu, 12 Mar 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p>
<h3 id="q-i-have-several-simulations-organized-with-git-tags-i-know-i-can-compare-the-metrics-with-dvc-metrics-diff-a_rev-b_rev-substituting-hashes-branches-or-tags-for-a_rev-and-b_rev-but-what-if-i-wanted-to-see-the-metrics-for-a-list-of-tags" style="position:relative;">Q: I have several simulations organized with Git tags. I know I can compare the metrics with <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff [a_rev] [b_rev]</code></a>, substituting hashes, branches, or tags for [a_rev] and [b_rev]. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/687634347104403528" target="_blank" rel="nofollow noopener noreferrer">But what if I wanted to see the metrics for a list of tags?</a><a href="#q-i-have-several-simulations-organized-with-git-tags-i-know-i-can-compare-the-metrics-with-dvc-metrics-diff-a_rev-b_rev-substituting-hashes-branches-or-tags-for-a_rev-and-b_rev-but-what-if-i-wanted-to-see-the-metrics-for-a-list-of-tags" aria-label="q i have several simulations organized with git tags i know i can compare the metrics with dvc metrics diff a_rev b_rev substituting hashes branches or tags for a_rev and b_rev but what if i wanted to see the metrics for a list of tags permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC has a built in function for this! You can use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> with the
<code>-T</code> option:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics show</span> <span class="token parameter variable">-T</span></span></code></pre></div>
<p>to list the metrics for all tagged experiments.</p>
<p>Also, we have a couple of relevant discussions going on in our GitHub repo about
<a href="https://github.com/iterative/dvc/issues/2799" target="_blank" rel="nofollow noopener noreferrer">handling experiments</a> and
<a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">hyperparameter tuning</a>. Feel free
to join the discussion and let us know what kind of support would help you most.</p>
<h3 id="q-is-there-a-recommended-way-to-save-metadata-about-the-data-in-a-dvc-file-in-particular-id-like-to-save-summary-statistics-eg-mean-minimum-and-maximum-about-my-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/685105104340386037" target="_blank" rel="nofollow noopener noreferrer">Is there a recommended way to save metadata about the data in a <code>.dvc</code> file?</a> In particular, I'd like to save summary statistics (e.g., mean, minimum, and maximum) about my data.<a href="#q-is-there-a-recommended-way-to-save-metadata-about-the-data-in-a-dvc-file-in-particular-id-like-to-save-summary-statistics-eg-mean-minimum-and-maximum-about-my-data" aria-label="q is there a recommended way to save metadata about the data in a dvc file in particular id like to save summary statistics eg mean minimum and maximum about my data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>One simple way to keep metadata in a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file is by using the <code>meta</code> field.
Each <code>meta</code> entry is a <code>key:value</code> pair (for example, <code>name: Jean-Luc</code>). The
<code>meta</code> field can be manually added or written programmatically, but note that if
the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file is overwritten (perhaps by <code>dvc run</code>, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, or
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) these values will not be preserved. You can read more about this
<a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>.</p>
<p>Another approach would be to track the statistics of your dataset in a metric
file, just as you might track performance metrics of a model. For a tutorial on
using DVC metrics please
<a href="https://dvc.org/doc/command-reference/metrics" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p>
<h3 id="q-my-team-has-been-using-dvc-in-production-when-we-upgraded-from-dvc-version-0710-we-started-getting-an-error-message-error-unexpected-error---my-folder-is-not-a-git-repository-whats-going-on" style="position:relative;">Q: My team has been using DVC in production. When we upgraded from DVC version 0.71.0, we started getting an error message: <code>ERROR: unexpected error - /my-folder is not a git repository</code>. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/687403454989467650" target="_blank" rel="nofollow noopener noreferrer">What's going on?</a><a href="#q-my-team-has-been-using-dvc-in-production-when-we-upgraded-from-dvc-version-0710-we-started-getting-an-error-message-error-unexpected-error---my-folder-is-not-a-git-repository-whats-going-on" aria-label="q my team has been using dvc in production when we upgraded from dvc version 0710 we started getting an error message error unexpected error my folder is not a git repository whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is a consequence of new support we've added for monorepos with the
<a href="https://dvc.org/doc/command-reference/init#--subdir"><code>dvc init --subdir</code></a> functionality
(<a href="https://dvc.org/doc/command-reference/init#init" target="_blank" rel="nofollow noopener noreferrer">see more here</a>), which lets
there be multiple DVC projects within a single Git repository. Now, if a DVC
repository doesn't contain a <code>.git</code> directory, DVC expects the <code>no_scm</code> flag to
be present in <code>.dvc/config</code> and raises an error if not. For example, one of our
users reported this when using DVC to pull files into a Docker container that
didn't have Git initialized (for more about using DVC without Git,
<a href="https://dvc.org/doc/command-reference/init#initializing-dvc-without-git" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>).</p>
<p>You can fix this by running <a href="https://dvc.org/doc/command-reference/config"><code>dvc config core.no_scm true</code></a> (you could include
this command in the script that creates Docker images). Alternately, you could
include <code>.git</code> in your Docker container, but this is not advisable for all
situations.</p>
<p>We are currently working to
<a href="https://github.com/iterative/dvc/issues/3474" target="_blank" rel="nofollow noopener noreferrer">add graceful error-handling</a> for
this particular issue so stay tuned.</p>
<h3 id="q-is-there-a-way-to-force-the-pipeline-to-rerun-even-if-its-dependencies-havent-changed" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/687422002822381609" target="_blank" rel="nofollow noopener noreferrer">Is there a way to force the pipeline to rerun, even if its dependencies haven't changed?</a><a href="#q-is-there-a-way-to-force-the-pipeline-to-rerun-even-if-its-dependencies-havent-changed" aria-label="q is there a way to force the pipeline to rerun even if its dependencies havent changed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> has a flag that should help here. You can use the <code>-f</code> or
<code>--force</code> flag to reproduce the pipeline even when no changes in the
dependencies (for example, a training datset tracked by DVC) have been found. So
if you had a hypoethetical DVC pipeline whose final process was <code>deploy.dvc</code>,
you could run <a href="https://dvc.org/doc/command-reference/repro#-f"><code>dvc repro -f deploy.dvc</code></a> to rerun the whole pipeline.</p>
<h3 id="q-whats-the-best-way-to-organize-dvc-repositories-if-i-have-several-training-datasets-shared-by-several-projects-some-projects-use-only-one-dataset-while-other-use-several-can-one-project-have-dvc-files-corresponding-to-different-remotes" style="position:relative;">Q: What's the best way to organize DVC repositories if I have several training datasets shared by several projects? Some projects use only one dataset while other use several. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/670664813973864449" target="_blank" rel="nofollow noopener noreferrer">Can one project have <code>.dvc</code> files corresponding to different remotes?</a><a href="#q-whats-the-best-way-to-organize-dvc-repositories-if-i-have-several-training-datasets-shared-by-several-projects-some-projects-use-only-one-dataset-while-other-use-several-can-one-project-have-dvc-files-corresponding-to-different-remotes" aria-label="q whats the best way to organize dvc repositories if i have several training datasets shared by several projects some projects use only one dataset while other use several can one project have dvc files corresponding to different remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, one project directory can contain datasets from several different DVC
remotes. Specifically, DVC has functions <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> and <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> that emulate
the experience of using a package manager for grabbing datasets from external
sources. You can use <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> or <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> to access any number of datasets
that are dependencies in a given project. For more on this,
<a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">see our tutorial on data registries</a>.</p>
<h3 id="q-what-are-the-risks-of-using-dvc-on-confidential-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/689848196473684024" target="_blank" rel="nofollow noopener noreferrer">What are the risks of using DVC on confidential data?</a><a href="#q-what-are-the-risks-of-using-dvc-on-confidential-data" aria-label="q what are the risks of using dvc on confidential data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC doesn't collect any information about your data (or code, or models, for
that matter). You may have noticed that DVC
<a href="https://dvc.org/doc/user-guide/analytics" target="_blank" rel="nofollow noopener noreferrer">collects Anonymized Usage Analytics</a>,
which users may
<a href="https://dvc.org/doc/user-guide/analytics#opting-out" target="_blank" rel="nofollow noopener noreferrer">opt out of</a>. The data we
collect is extremely limited and anonymized, as it is collected mainly for the
purpose of prioritizing bugs and feature development based on DVC usage. For
example, we collect info about your operating system, DVC version, and
installation method (the
<a href="https://dvc.org/doc/user-guide/analytics#what" target="_blank" rel="nofollow noopener noreferrer">complete list of collected features is here</a>).</p>
<p>Many of our users work with sensitive or private data, and we've developed DVC
with such scenarios in mind from day one.</p>
<h3 id="q-can-you-suggest-a-reference-architecture-for-using-dvc-as-part-of-mlops" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/683890642631524392" target="_blank" rel="nofollow noopener noreferrer">Can you suggest a reference architecture for using DVC as part of MLOps?</a><a href="#q-can-you-suggest-a-reference-architecture-for-using-dvc-as-part-of-mlops" aria-label="q can you suggest a reference architecture for using dvc as part of mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Increasingly, DVC is being used not to just to version and manage machine
learning projects, but as part of MLOps, <em>practices for combining data science
and software engineering</em>. As MLOps is a fairly new discipline, standards and
references aren't yet solidified. So while there isn't (<em>yet</em>) a standard recipe
for using DVC in MLOps projects, we can point you to a few architectures we
like, and which have been reported in sufficient detail to recreate.</p>
<p>First, DVC can be used to detect events (such as dataset changes) in a CI/CD
system that traditional version control systems might not be able to. An
excellent and thorough
<a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">blog by Danilo Sato et al.</a>
explores using DVC in this way, as part of a CI/CD system that retrains a model
automatically when changes in the dataset are detected.</p>
<p>Second, DVC can be used to support model training on cloud GPUs, particularly as
a tool for pushing and pulling files (such as datasets and trained models)
between cloud computing instances, DVC repositories, and other environments.
This architecture was the subject of a
<a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" target="_blank" rel="nofollow noopener noreferrer">recent blog by Marcel Mikl and Bert Besser</a>.
Their report describes the cloud computing setup and continuous integration
pipeline quite well.</p>
<p>If you develop your own architecture for using DVC in MLOps, please keep us
posted. We'll be eager to learn from your experience. Also, keep an eye on our
blog in the next few months. We're rolling out some new tools with a focus on
MLOps!</p>https://dvc.org/blog/march-20-dvc-heartbeathttps://dvc.org/blog/march-20-dvc-heartbeatWed, 11 Mar 2020 00:00:00 GMT<p>Welcome to the March Heartbeat! Here are some highlights from our team and
community this past month:</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>DVC is STILL growing!</strong> In February, Senior Software Engineer
<a href="https://www.linkedin.com/in/jiojiajiu/" target="_blank" rel="nofollow noopener noreferrer">Guro Bokum</a> joined DVC. He's previously
contributed to the core DVC code base and brings several years of full-stack
engineering expertise to the team. Welcome, Guro!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/40e5aae472d45aa14f9f6daee17ff183/39600/hi_guro.png" alt="hi guro" title="Imgx667" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Welcome, Guro!</em></p>
<p><strong>New feature alert.</strong> We've received many requests for
<a href="https://en.wikipedia.org/wiki/Monorepo" target="_blank" rel="nofollow noopener noreferrer">monorepo</a> support in DVC. As of DVC
<a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">release 0.87.0</a>, users can version
data science projects within a monorepo! The new <a href="https://dvc.org/doc/command-reference/init#--subdir"><code>dvc init --subdir</code></a>
functionality is designed to allow multiple DVC repositories within a single Git
repository. Don't forget to upgrade and
<a href="https://dvc.org/doc/command-reference/init" target="_blank" rel="nofollow noopener noreferrer">check out the latest docs</a>.</p>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>First, there's an intriguing
<a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">discussion evolving in the DVC repo</a>
about how machine learning hyperparameters (such as learning rate, number of
layers in a deep neural network, etc.) can be tracked. Right now,
hyperparameters are tracked as source code (i.e., with Git). Could we use some
kind of abstraction to separate hyperparameters from source code in a
DVC-managed project? Read on and feel free to jump into this discussion, largely
helmed by software developer and DVC contributor
<a href="http://elgehelge.github.io/" target="_blank" rel="nofollow noopener noreferrer">Helge Munk Jacobsen</a>.</p>
<p>Another discussion we appreciated happened on Twitter:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">We give tools like Slack and Zoom a lot of credit for making remote work possible, and I think Git and every hosted DVC system should equally get the same credit. Imagine life for a second without version control. Think about that.</p>— Celestine (@cyberomin) <a href="https://twitter.com/cyberomin/status/1223651811082559488?ref_src=twsrc%5Etfw">February 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Thanks, <a href="https://twitter.com/cyberomin" target="_blank" rel="nofollow noopener noreferrer">@cyberomin</a>!</p>
<p>Elsewhere on the internet, DVC made the cut in a much-shared blog,
<a href="https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501" target="_blank" rel="nofollow noopener noreferrer">Five Interesting Data Engineering Projects</a>
by <a href="https://twitter.com/squarecog" target="_blank" rel="nofollow noopener noreferrer">Dmitry Ryaboy</a> (VP of Engineering at biotech
startup Zymergen, and formerly Twitter). Dmitry wrote:</p>
<blockquote>
<p>To be honest, I’m a bit of a skeptic on “git for data” and various automated
data / workflow versioning schemes: various approaches I’ve seen in the past
were either too partial to be useful, or required too drastic a change in how
data scientists worked to get a realistic chance at adoption. So I ignored, or
even explicitly avoided, checking DVC out as the buzz grew. I’ve finally
checked it out and… it looks like maybe this has legs? Metrics tied to
branches / versions are a great feature. Tying the idea of git-like branches
to training multiple models makes the value prop clear. The implementation,
using Git for code and datafile index storage, while leveraging scalable data
stores for data, and trying to reduce overall storage cost by being clever
about reuse, looks sane. A lot of what they have to say in
<a href="https://dvc.org/doc/understanding-dvc" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/understanding-dvc</a> rings true.</p>
</blockquote>
<p>Check out the full blog here:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Five Interesting Data Engineering Projects</h4>
<div class="elp-description">There’s been a lot of activity in the data engineering world lately, and a ton of really interesting projects and ideas have come on the scene in the past few years. This post is an introduction to (just) five that I think a data engineer who wants to stay current needs to know about.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-03-11/dmitry_r-dbf24d69ef2c84c69371729b25b99e50.jpg" alt="Five Interesting Data Engineering Projects">
</div>
</a>
</section>
<p></p>
<p>One of the areas that DVC is growing into is continuous integration and
continuous deployment (CI/CD), a part of the nascent field of MLOps. Naturally,
we were thrilled to discover that CI/CD with DVC is taught in a new Packt book,
<a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" target="_blank" rel="nofollow noopener noreferrer">"Learn Python by Building Data Science Applications"</a>
by David Katz and Philipp Kats.</p>
<p>In the authors words, the goal of this book is to teach data scientists and
engineers "not only how to implement Python in data science projects, but also
how to maintain and design them to meet high programming standards." Needless to
say, we are considering starting a book club. Grab a copy here:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Learn Python by Building Data Science Applications</h4>
<div class="elp-description">Understand the constructs of the Python programming language and use them to build data science projects</div>
<div class="elp-link">packtpub.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-03-11/packt-d8568693d8daf46cb46c15e3d6f1103d.jpeg" alt="Learn Python by Building Data Science Applications">
</div>
</a>
</section>
<p></p>
<p>Last year in Mexico, DVC contributor Ramón Valles gave a talk about reproducible
machine learning workflows at Data Day Monterrey—and
<a href="https://www.youtube.com/watch?v=tAxG-n20Di4" target="_blank" rel="nofollow noopener noreferrer">a video of his presentation</a> is
now online! In this Spanish-language talk, Ramón gives a thorough look at DVC,
particularly building pipelines for reproducible ML.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.youtube.com/watch?v=tAxG-n20Di4" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Experimentación ágil de machine learning con DVC</h4>
<div class="elp-description">Data Day Monterrey '19</div>
<div class="elp-link">youtube.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-03-11/dataday_mr-80082fa35e146d3cc5e7ff0afdcb8857.png" alt="Experimentación ágil de machine learning con DVC">
</div>
</a>
</section>
<p></p>
<p>Finally, DVC data scientist Elle (that's me!) released a new public dataset of
posts from the Reddit forum
<a href="https://reddit.com/r/amitheasshole" target="_blank" rel="nofollow noopener noreferrer">r/AmItheAsshole</a>, and reported some
preliminary analyses. We're inviting anyone and everyone to play with the data,
make some hypotheses and share their findings. Check it out here:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://blog.dvc.org/a-public-reddit-dataset" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">AITA for making this? A public dataset of Reddit posts about moral dilemmas</h4>
<div class="elp-description">Delve into an open natural language dataset of posts about moral dilemmas from r/AmItheAsshole. Use this dataset for whatever you want- here's how to get it and start playing.</div>
<div class="elp-link">blog.dvc.org</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-03-11/aita_sm-bb0ee7157daa11e5246496403d0f6e16.png" alt="AITA for making this? A public dataset of Reddit posts about moral dilemmas">
</div>
</a>
</section>
<p></p>
<p>That's all for now—thanks for reading, and be in touch on our
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>,
<a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/february-20-community-gemshttps://dvc.org/blog/february-20-community-gemsWed, 19 Feb 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Welcome to the February roundup of useful, intriguing, and good-to-know
discussions going on with DVC users and developers. Let's dive right in with
some questions from our Discord channel.</p>
<h3 id="q-if-i-have-multiple-outputs-from-a-dvc-pipeline-and-only-want-to-checkout-one-what-command-would-i-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/670233820326264843" target="_blank" rel="nofollow noopener noreferrer">If I have multiple outputs from a DVC pipeline and only want to checkout one, what command would I run?</a><a href="#q-if-i-have-multiple-outputs-from-a-dvc-pipeline-and-only-want-to-checkout-one-what-command-would-i-run" aria-label="q if i have multiple outputs from a dvc pipeline and only want to checkout one what command would i run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>By default, <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> is written for a
<a href="https://dvc.org/doc/command-reference/checkout" target="_blank" rel="nofollow noopener noreferrer">Git-like experience</a>, meaning
that it will sync your local workspace with all the model files, dependencies,
and outputs specified by a project's <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. If you only want to access
one artifact from the project, you can do this with
<a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout <path to file></code></a>. This will deliver the specified file to your
workspace.</p>
<p>If you're interested in sharing specific artifacts (like data files or model
binaries) with other users, you might also consider <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>.
These functions are ideal for downloading a single file (or a few files) to the
local workspace, instead of the whole project.</p>
<h3 id="q-i-have-a-complicated-use-case-were-trying-to-set-up-a-system-where-users-act-as-data-scientists-theyd-select-data-which-would-be-cleanedtransformed-in-the-backend-and-experiment-with-model-hyperparameters-until-theyre-happy-with-the-model-result-then-they-can-save-the-model-including-artifacts-like-the-input-data-used-metrics-and-binary-model-file-placing-the-experiment-under-version-control-later-they-can-load-the-model-again-and-select-new-input-data-from-our-database-change-parameters-and-update-it-there-might-be-hundreds-of-separate-models-can-dvc-do-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/668773484549242890" target="_blank" rel="nofollow noopener noreferrer">I have a complicated use case.</a> We're trying to set up a system where users act as data scientists. They'd select data, which would be cleaned/transformed in the backend, and experiment with model hyperparameters until they're happy with the model result. Then they can "save" the model, including artifacts like the input data used, metrics, and binary model file, placing the experiment under version control. Later they can "load" the model again and select new input data from our database, change parameters, and "update it". There might be hundreds of separate models. Can DVC do this?<a href="#q-i-have-a-complicated-use-case-were-trying-to-set-up-a-system-where-users-act-as-data-scientists-theyd-select-data-which-would-be-cleanedtransformed-in-the-backend-and-experiment-with-model-hyperparameters-until-theyre-happy-with-the-model-result-then-they-can-save-the-model-including-artifacts-like-the-input-data-used-metrics-and-binary-model-file-placing-the-experiment-under-version-control-later-they-can-load-the-model-again-and-select-new-input-data-from-our-database-change-parameters-and-update-it-there-might-be-hundreds-of-separate-models-can-dvc-do-this" aria-label="q i have a complicated use case were trying to set up a system where users act as data scientists theyd select data which would be cleanedtransformed in the backend and experiment with model hyperparameters until theyre happy with the model result then they can save the model including artifacts like the input data used metrics and binary model file placing the experiment under version control later they can load the model again and select new input data from our database change parameters and update it there might be hundreds of separate models can dvc do this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Most of this functionality is supported by DVC already. We recommend
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> as a method for giving users access to data in a repostiory (and
also check out our
<a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">tutorial on data registries</a>).
For pre-processing data,
<a href="https://dvc.org/doc/get-started/pipeline" target="_blank" rel="nofollow noopener noreferrer">DVC pipelines</a> can automate a
procedure for transforming and cleaning inputs (i.e., you can use bash scripts
to <code>dvc run</code> the pipeline whenever a user selects a dataset). Saving the
workspace after experimentation, including model files, metrics, and outputs, is
a core function of DVC (see <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> functions). We also have a
<a href="https://dvc.org/doc/use-cases/data-registries#programatic-reusability-of-dvc-data" target="_blank" rel="nofollow noopener noreferrer">Python API</a>
so users can load artifacts like datasets and model files into their local
Python session. When they're done experimenting, they can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> and
<a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> their progress. Users can later "pull" a saved workspace and all
associated files using <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a></p>
<p>As for how to organize hundreds of separate experiments, we're still evolving
our strategy and best-practice recommendations. It's conceivable that each
experiment could be carried out and saved on a separate branch of a project
repository. Our thoughts about structuring version control around architecture
search and hyperparameter tuning could fill up a whole blog (and probably will
in the not-so-distant future); check out one of our
<a href="https://github.com/iterative/dvc/issues/2799" target="_blank" rel="nofollow noopener noreferrer">recent conversation threads</a> if
you'd like to see where we're currently at. And please let us know how your use
case goes—at this stage, we'd love to hear what works for you.</p>
<h3 id="q-whats-the-difference-between-config-and-configlocal-files-is-it-safe-to-do-git-commit-without-including-my-config-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/666708671333400599" target="_blank" rel="nofollow noopener noreferrer">What's the difference</a> between <code>config</code> and <code>config.local</code> files? Is it safe to do git commit without including my config file?<a href="#q-whats-the-difference-between-config-and-configlocal-files-is-it-safe-to-do-git-commit-without-including-my-config-file" aria-label="q whats the difference between config and configlocal files is it safe to do git commit without including my config file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are indeed two kinds of config files you might come across in your project
directory's <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> folder and <code>.gitignore</code> file. The key difference is that
<code>config</code> is intended to be committed to Git, while <code>config.local</code> is not. You'd
use <code>config.local</code> to store sensitive information (like personal credentials for
SSH or another kind of authenticated storage) or settings specific to your local
environment—things you wouldn't want to push to a GitHub repo. DVC only modifies
<code>config.local</code> when you explicitly use the <code>--local</code> flag in the <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a> or
<a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote *</code></a> commands, so outside of these cases you shouldn't have to worry
about it.</p>
<p>As for using <code>git commit</code> without the <code>config</code> file, it is safe. <em>But</em> you
should check if there are any settings in <code>config.local</code> that you actually want
to save to <code>config</code>. This would be rare, since as we mentioned, you'd only have
settings in <code>config.local</code> if you expressly called for them with the <code>--local</code>
flag.</p>
<h3 id="q-i-have-an-azure-storage-account-container-and-the-only-link-i-can-see-in-my-azure-portal-for-the-container-is-an-http-link-but-the-tutorial-on-dvc-shows-azure-storage-accessed-with-the-azure-protocol-which-is-right" style="position:relative;">Q: I have an Azure storage account container, and the only link I can see in my Azure portal for the container is an <code>http://</code> link. But the tutorial on DVC shows Azure storage accessed with the <code>azure://</code> protocol. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/675087897661276169" target="_blank" rel="nofollow noopener noreferrer">Which is right?</a><a href="#q-i-have-an-azure-storage-account-container-and-the-only-link-i-can-see-in-my-azure-portal-for-the-container-is-an-http-link-but-the-tutorial-on-dvc-shows-azure-storage-accessed-with-the-azure-protocol-which-is-right" aria-label="q i have an azure storage account container and the only link i can see in my azure portal for the container is an http link but the tutorial on dvc shows azure storage accessed with the azure protocol which is right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>What you're describing is exactly as it should be. <code>azure://</code> is an internal URL
protocol that tells DVC which API to use to connect to your remote storage, not
the exact address of your Blob. You can use the format
<code>azure://<container-name>/<optional-path></code>. For more details, you can refer to
our documentation about
<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">supported storage types</a>.</p>
<h3 id="q-im-using-dvc-to-version-my-data-with-google-drive-storage-if-i-want-a-developer-to-be-able-to-download-the-data-can-i-give-them-my-gdrive_client_id-and-gdrive_client_secret-or-maybe-give-them-permission-to-access-my-google-drive-folder" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/667198775361536019" target="_blank" rel="nofollow noopener noreferrer">I'm using DVC to version my data with Google Drive storage.</a> If I want a developer to be able to download the data, can I give them my <code>gdrive_client_id</code> and <code>gdrive_client_secret</code>, or maybe give them permission to access my Google Drive folder?<a href="#q-im-using-dvc-to-version-my-data-with-google-drive-storage-if-i-want-a-developer-to-be-able-to-download-the-data-can-i-give-them-my-gdrive_client_id-and-gdrive_client_secret-or-maybe-give-them-permission-to-access-my-google-drive-folder" aria-label="q im using dvc to version my data with google drive storage if i want a developer to be able to download the data can i give them my gdrive_client_id and gdrive_client_secret or maybe give them permission to access my google drive folder permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For Google Drive, <code>gdrive_client_id</code> and <code>gdrive_client_secret</code> aren't used to
access a specific user's Google Drive disk; they're predominantly used by
Google's API to
<a href="https://rclone.org/drive/#making-your-own-client-id" target="_blank" rel="nofollow noopener noreferrer">track usage and set appropriate rate limits</a>.
So the risk in sharing them is not that your personal files will be vulnerable,
but that your API usage limits could be negatively affected if others are using
it with your credentials. Whether this risk is acceptable is up to you. It's not
unusual for teams and organizations to share a set of credentials, so a
reasonable level of security may mean ensuring that the <code>config</code> file for your
project (which typically contains Google Drive credentials) is only visible to
team members.</p>
<p>Please check out our
<a href="https://dvc.org/doc/user-guide/setup-google-drive-remote" target="_blank" rel="nofollow noopener noreferrer">docs about Google Drive</a>,
too, for more about how DVC uses the Google Drive API.</p>
<h3 id="q-i-just-tried-to-upgrade-dvc-via-homebrew-and-got-a-sha256-mismatch-error-whats-going-on" style="position:relative;">Q: I just tried to upgrade DVC via <code>homebrew</code> and got a "SHA256 mismatch" error. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/672930535261339669" target="_blank" rel="nofollow noopener noreferrer">What's going on</a>?<a href="#q-i-just-tried-to-upgrade-dvc-via-homebrew-and-got-a-sha256-mismatch-error-whats-going-on" aria-label="q i just tried to upgrade dvc via homebrew and got a sha256 mismatch error whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>What most likely happened is that you first installed DVC via
<code>brew install iterative/homebrew-dvc/dvc</code>, which is no longer supported—because
DVC is now a core Homebrew formula! Please uninstall and reinstall using
<code>brew install dvc</code> for uninterrupted upgrades in the future.</p>
<h3 id="q-i-still-cant-convince-myself-to-version-control-the-data-rather-than-meta-data-can-anyone-give-me-a-strong-argument-against-version-controlling-data-file-paths-in-config-files-instead-of-using-dvc" style="position:relative;">Q: <a href="https://www.reddit.com/r/datascience/comments/aqkg59/does_anyone_use_data_version_control_dvc_thoughts/eq62lkt?utm_source=share&utm_medium=web2x" target="_blank" rel="nofollow noopener noreferrer">I still can't convince myself to version-control the data rather than meta-data.</a> Can anyone give me a strong argument against version controlling data file paths in config files instead of using DVC?<a href="#q-i-still-cant-convince-myself-to-version-control-the-data-rather-than-meta-data-can-anyone-give-me-a-strong-argument-against-version-controlling-data-file-paths-in-config-files-instead-of-using-dvc" aria-label="q i still cant convince myself to version control the data rather than meta data can anyone give me a strong argument against version controlling data file paths in config files instead of using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><em>This question is from a <a href="https://bit.ly/38HOEcj" target="_blank" rel="nofollow noopener noreferrer">Reddit discussion.</a></em></p>
<p>Versioning the meta-data associated with your dataset is certainly a workable
strategy. You can use prefixes and suffixes to distinguish models trained on
different versions of data, and keep your data files in one <code>.gitignored</code>
directory. That may be enough for some projects. In our experience, though,
we've found this comes with a host of complications that don't scale well:</p>
<ol>
<li>You'll have to write custom code to support this configuration, specifying
filepaths to your dataset with hardcoded links.</li>
<li>For files that are outputs of your analysis pipeline, you'll need to agree on
conventions for suffixes/prefixes for naming to specify which version of the
dataset was used.</li>
<li>Depending on the meta-data you use to version data files, you may not detect
changes made by users. Even if you can tell a change has occurred, you may
not be able to track <em>who</em> did it <em>when</em>.</li>
</ol>
<p>We designed DVC to optimize data management from the user's perspective: users
can change the dataset version without changing their code, so organizations
don't have to adhere to explicit filenaming conventions and hardcoded links that
are prone to human error. Furthermore, versioning data similar to how Git
versions code provides a largely immutable record of every change that has
occurred. We think this is important as teams and projects grow in complexity.
And from a systems-level perspective, DVC does more than track data: it
dedpulicates files behind the scenes, provides simple interfaces for sharing
datasets (and models!) with collaborators and users, and connects specific model
files with the dataset versions they were trained on.</p>
<p>To summarize, DVC is not the only way to version your data. But we think it's
one way to reduce the overhead of managing data infrastructure when your project
involves experimentation or collaboration.</p>https://dvc.org/blog/a-public-reddit-datasethttps://dvc.org/blog/a-public-reddit-datasetMon, 17 Feb 2020 00:00:00 GMT<p>In data science, we frequently deal with classification problems like, <em>is this
<a href="https://www.ics.uci.edu/~vpsaini/" target="_blank" rel="nofollow noopener noreferrer">Yelp reviewer unhappy</a> with their brunch? Is
<a href="https://archive.ics.uci.edu/ml/datasets/spambase" target="_blank" rel="nofollow noopener noreferrer">this email</a> begging me to
claim my long-lost inheritance spam? Does this
<a href="http://ai.stanford.edu/~amaas/data/sentiment/" target="_blank" rel="nofollow noopener noreferrer">movie critic</a> have a positive
opinion of Cats?</em></p>
<p>Perhaps we should also consider the fundamental introspective matter of, <em>am I
maybe being a bit of an asshole?</em></p>
<p>I want to share a dataset of collected moral dilemmas shared on Reddit, as well
as the judgments handed down by a jury of Redditors. The wellspring of this data
is the <a href="https://www.reddit.com/r/AmItheAsshole/" target="_blank" rel="nofollow noopener noreferrer">r/AmItheAsshole</a> subreddit, one
of the natural wonders of the digital world. In this article, I'll show you
what's in the dataset, how to get it, and some things you can do to move the
frontiers of Asshole research forward.</p>
<h2 id="what-makes-an-asshole" style="position:relative;">What makes an Asshole?<a href="#what-makes-an-asshole" aria-label="what makes an asshole permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>r/AmItheAsshole is a semi-structured online forum that’s the internet’s closest
approximation of a judicial system. In this corner of the web, citizens post
situations from their lives and Redditors vote to decide if the writer has acted
as The Asshole or not. For example:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b140849336e06dc98fe3b111add8224e/39600/aita_sample.png" alt="aita sample" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Without bringing any code into the picture, it’s intuitive to think of each new
post as a classification task for the subreddit. Formally, we could think of the
subreddit as executing a function <em>f</em> such that</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9b43c96909c892a85245ab99f863766e/39600/aita_formula.png" alt="aita formula" title="aita formula" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Of course, finding f won’t be trivial. To be frank, I’m not positive how well we
could hope to forecast the rulings of the subreddit. A lot of posts are not easy
for me to decide- like,</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 680px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/42054fa4bae8f5578a79e5da28bd5181/39600/aita_llama.png" alt="aita llama" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>There are also many times I find myself disagreeing with the subreddit’s
verdict. All this is to say, I don’t think it’s obvious how well a given human
would do on the task of predicting whether Redditors find someone an Asshole.
Nor is it clear how well we could ever hope for a machine to do approximating
their judgment.</p>
<p>It seems fun to try, though. It helps that the data is plentiful: because the
subreddit is popular and well-moderated, there’s an especially strong volume of
high-quality content (re: on-topic and appropriately formatted) being posted
daily.</p>
<h2 id="building-the-dataset" style="position:relative;">Building the dataset<a href="#building-the-dataset" aria-label="building the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>I pulled content from r/AmITheAsshole dating from the first post in 2012 to
January 1, 2020 using the <a href="https://pushshift.io/" target="_blank" rel="nofollow noopener noreferrer">pushshift.io</a> API to get post
ids and
<a href="https://www.reddit.com/wiki/faq#wiki_how_is_a_submission.27s_score_determined.3F" target="_blank" rel="nofollow noopener noreferrer">scores</a>,
followed by Reddit’s API (<a href="https://praw.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">praw</a>) to get
post content and meta-data. Using a
<a href="https://openai.com/blog/better-language-models/" target="_blank" rel="nofollow noopener noreferrer">similar standard as OpenAI</a>
for trawling Reddit, I collected text from posts with scores of 3 or more only
for quality control. This cut the number of posts from ~355K to ~111K. Each data
point contains an official id code, timestamp, post title, post text, verdict,
score, and comment count; usernames are not included. The scraping and cleaning
code is available
<a href="https://github.com/iterative/aita_dataset" target="_blank" rel="nofollow noopener noreferrer">in the project GitHub repo</a>. For
simplicity on the first iteration of this problem, I didn’t scrape post
comments, which can number in the thousands for popular posts. But, should
sufficient interest arise, I’d consider adding them to the dataset in some form.</p>
<p>To focus on the task of classifying posts, I did some light cleaning: I removed
posts in which the body of the text was redacted (surprisingly common) or blank,
and attempted to remove edits where the author had clearly given away the
verdict (e.g., an edit that says, “Update: You’re right, I was the asshole.”).
There were also verdicts that only occurred once (“cheap asshole”, “Crouching
Liar; hidden asshole”, “the pizza is the asshole”), so I restricted the dataset
to posts with standard verdicts. This left ~63K points. Below is a sample of the
resulting dataframe:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4a12da0214084c297acabff7878e4852/39600/df_sample.png" alt="df sample" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Click to enlarge.</em></p>
<p>The dataset is a snapshot of the subreddit in its current state, but the
subreddit is certain to change over time as new content gets added. In the
interest of having the most comprehensive dataset about being an asshole ever
collected, <em>I’m planning to update this dataset monthly with new posts.</em></p>
<h2 id="how-to-get-the-dataset" style="position:relative;">How to get the dataset<a href="#how-to-get-the-dataset" aria-label="how to get the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Since this dataset will be updated regularly, we’re using git and DVC to
package, version, and release it. The data itself is stored in an S3 bucket, and
you can use DVC to import the data to your workspace. If you haven't already
you'll need to <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install DVC</a>; one of the simplest
ways is <code>pip install dvc</code>.</p>
<p>Say you have a directory on your local machine where you plan to build some
analysis scripts. Simply run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/aita_dataset <span class="token punctuation">\</span>
aita_clean.csv</span></code></pre></div>
<p>This will download a .csv dataset into your local directory, corresponding to
the cleaned version. If you wanted the raw dataset, you would substitute
<code>aita_raw.csv</code> for <code>aita_clean.csv</code>.</p>
<p>Because the dataset is >100 MB, I’ve created a git branch (called “lightweight”)
with 10,000 randomly sampled (cleaned) data points for quick-and-dirty
experimentation that won’t occupy all your laptop’s memory. To download only
this smaller dataset, run</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> lightweight <span class="token punctuation">\</span>
https://github.com/iterative/aita_dataset <span class="token punctuation">\</span>
aita_clean.csv</span></code></pre></div>
<h2 id="a-quick-look-at-the-data" style="position:relative;">A quick look at the data<a href="#a-quick-look-at-the-data" aria-label="a quick look at the data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Let’s take a flyover look at the dataset so far. The code to make the following
visuals and results is
<a href="https://github.com/andronovhopf/aita_viz_and_classify" target="_blank" rel="nofollow noopener noreferrer">available on GitHub</a>.
First, here’s a frequency plot for how common different verdicts are on the
subreddit. In addition to “Asshole” and “Not the Asshole”, there are two
additional rulings: “Everybody Sucks” and “No Assholes Here”.</p>
<p><img src="https://dvc.org/2020-02-17/freq_plot-88ad442f08b13a57408f75dd5b54bd63.svg" alt=""></p>
<p>In general agreement with an
<a href="http://www.nathancunn.com/2019-04-04-am-i-the-asshole/" target="_blank" rel="nofollow noopener noreferrer">analysis by Nathan Cunn</a>,
the majority of posts are deemed “Not the Asshole” or “No Assholes Here”. If you
are posting on r/AmITheAsshole, you are probably not the asshole.</p>
<p>Next, I attempted a very basic classifier, logistic regression using 1-gram
frequencies (i.e., the frequency of word occurences in post titles and bodies)
as features. This is intended to give a baseline for what kind of performance
any future modeling efforts should beat. Because of the strong class imbalance,
I used
<a href="https://imbalanced-learn.org/stable/over_sampling.html#smote-variants" target="_blank" rel="nofollow noopener noreferrer">SMOTE to oversample</a>
Asshole posts. And, for simplicity, I binarized the category labels:</p>
<table><thead><tr><th align="center">Verdict</th><th align="center">Label</th></tr></thead><tbody><tr><td align="center">Asshole</td><td align="center">1</td></tr><tr><td align="center">Everyone Sucks</td><td align="center">1</td></tr><tr><td align="center">Not the Asshole</td><td align="center">0</td></tr><tr><td align="center">No Assholes Here</td><td align="center">0</td></tr></tbody></table>
<p>With 5-fold cross-validation, this classifier performed above-chance but
modestly: accuracy was 62.0% +/- 0.005 (95% confidence interval). Curiously, the
only other classifier attempt I could find online
<a href="https://github.com/amr-amr/am-i-the-asshole" target="_blank" rel="nofollow noopener noreferrer">reported 61% accuracy on held-out data</a>
using the much more powerful BERT architecture. Considering that logistic
regression has zero hidden layers, and our features discard sequential
information entirely, we’re doing quite well! Although I can’t be certain, I’m
curious how much the discrepancy comes down to dataset size: the previous effort
with BERT appears to be trained on ~30K posts.</p>
<p>Seeing that logistic regression on word counts doesn’t produce total garbage, I
looked at which words were predictive of class using the
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html" target="_blank" rel="nofollow noopener noreferrer">chi-squared test</a>.
The top five informative words were mom, wife, mother, edit, and dad (looks like
Assholes go back to edit their posts). Since familial relationships featured
prominently, I
<a href="https://www.tidytextmining.com/twitter.html#comparing-word-usage" target="_blank" rel="nofollow noopener noreferrer">estimated the log odds ratio</a>
of being voted Asshole (versus Not the Asshole) if your post mentions a mom,
dad, girlfriend/wife or boyfriend/husband. Roughly, the log odds ratio
represents the difference in probability of a keyword occurring in Asshole posts
compared to Not-Asshole posts.</p>
<p><img src="https://dvc.org/2020-02-17/svg_kw2-b03427e1361dcff2e52b80a34525e4a2.svg" alt=""></p>
<p>For reference, the log odd ratios are computed with base 2; a score of 1 means
that Asshole posts are twice as likely to contain the keyword as Not the Asshole
posts. So keep in mind that the effect sizes we’re detecting, although almost
certainly non-zero, are still fairly small.</p>
<p>There seems to be a slight anti-parent trend, with Redditors being more likely
to absolve authors who mention a mom or dad. Only mentioning a female romantic
partner (wife/girlfriend) was associated with a greater likelihood of being
voted the Asshole. This surprised me. My unsubstantiated guess about the gender
difference in mentioning romantic partners is that women may be particularly
likely to question themselves when they act assertively in a relationship. If
this were the case, we might find an especially high proportion of
uncontroversial “Not the Asshole” posts from heterosexual women asking about
situations with their male partners.</p>
<h2 id="how-to-get-more-data" style="position:relative;">How to get more data<a href="#how-to-get-more-data" aria-label="how to get more data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As I said earlier, the plan is to grow the dataset over time. I’ve just run a
new scrape for posts from January 1-31, 2020 and am adding them to the public
dataset now. To check for a new release, you can re-run the <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> command
you used to grab the dataset.</p>
<p>If you’re serious about taking on a project such as, say, building a classifier
that beats our state of the art, word-count-based, logistic regression model,
I’d like to recommend a better way to integrate the dataset into your workflow:
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>. <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> is like <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>, but it preserves a link to the
hosted data set. This is desirable if you might iterate through several
experiments in the search for the right architecture, for example, or think
you’ll want to re-train a model . To get the dataset the first time, you’ll run:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://github.com/iterative/aita_dataset <span class="token punctuation">\</span>
aita_clean.csv</span></code></pre></div>
<p>Then, because the dataset in your workspace is linked to our dataset repository,
you can update it by simply running:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc update</span> aita_clean.csv</span></code></pre></div>
<p>An additional benefit of codifying the link between your copy of the dataset and
ours is that you can track the form of the dataset you used at different points
in your project development. You can jump back and forth through the project
history then, not only to previous versions of code but also to versions of
(specifically, links to) data. For example, you could roll back the state of the
project to before you updated the dataset and re-run your classifier:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> log <span class="token parameter variable">--oneline</span>
</span>58e28a5 retrain logistic reg
6a44161 update aita dataset
0de4fc3 try logistic regression classifier
a266f15 get aita dataset
55031b0 first commit
<span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> 0de4fc3
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span>
</span><span class="token line"><span class="token input">$ </span><span class="token command">python</span> train_classifier.py</span></code></pre></div>
<p>Oh, and one more note: you can always use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> to grab an
older version of the dataset using the tags associated with each release. The
current release is v.20.1 and the original release is v.20.0- the numeric codes
correspond to the year and month.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> v.20.0 <span class="token punctuation">\</span>
https://github.com/iterative/aita_dataset aita_clean.csv</span></code></pre></div>
<h2 id="whats-next" style="position:relative;">What’s next<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>I hope that sharing this evolving dataset invites some curiosity, because a lot
of questions come to mind:</p>
<ol>
<li>Can you beat our classifier that predicts how the subreddit will rule?</li>
<li>Is verdict even the most interesting outcome to predict? For example,
developer Scott Ratigan
<a href="https://github.com/scotteratigan/amitheahole" target="_blank" rel="nofollow noopener noreferrer">created a tool to estimate weighted scores</a>
for each post based on the comments (e.g., 75% Asshole, 25% Not the Asshole).
What metrics might invite deeper questions?</li>
<li>Can you identify sentences or phrases that are most informative about the
verdict Redditors reach?</li>
<li>Do voting patterns systematically differ by topic of discussion?</li>
<li>How reliable are verdicts? When a very similar situation is posted multiple
times, do Redditors usually vote the same way?</li>
<li>Is the subreddit’s posting and voting behavior changing over time?</li>
<li>Can you formulate any testable hypotheses based on
<a href="https://www.reddit.com/r/AmItheAsshole/comments/dcae07/2019_subscriber_survey_data_dump/?" target="_blank" rel="nofollow noopener noreferrer">this survey of the subreddit’s demographics</a></li>
<li>How often do non-Redditors agree with the subreddit? Under what circumstances
might they tend to disagree?</li>
</ol>
<p>I expect that leaning into the particulars of the dataset- thinking about how
the format influences the content, and how a subreddit might select for
participants that don’t fully represent the population at large- will lead to
more interesting questions than, say, aiming to forecast something about
morality in general. To put it another way, the data’s not unbiased- so maybe
try to learn something about those biases.</p>
<p>If you make something with this dataset, please share- perhaps we can form an
international Asshole research collective, or at least keep each other appraised
of findings. And of course, reach out if you encounter any difficulties or
probable errors (you can file issues
<a href="https://github.com/iterative/aita_dataset" target="_blank" rel="nofollow noopener noreferrer">on the GitHub repo</a>)!</p>
<p>Lastly, please stay tuned for more releases- there are hundreds of new posts
every day. The biggest asshole may still be out there.</p>
<hr>
<h3 id="more-resources" style="position:relative;">More resources<a href="#more-resources" aria-label="more resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You may want to check out a few more efforts to get at r/AmItheAsshole from a
data-scientific perspective, including
<a href="https://medium.com/@tom.gonda/what-does-reddit-argue-about-28432b11ea26" target="_blank" rel="nofollow noopener noreferrer">topic modeling</a>,
<a href="http://www.nathancunn.com/2019-04-04-am-i-the-asshole/" target="_blank" rel="nofollow noopener noreferrer">visualizing voting patterns</a>
and
<a href="https://twitter.com/felipehoffa/status/1223278090958209025" target="_blank" rel="nofollow noopener noreferrer">growth of the subreddit</a>,
and
<a href="https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/teaching/studienDiplomArbeiten/finished/2019/expose_fletcher.pdf" target="_blank" rel="nofollow noopener noreferrer">classification</a>
with <a href="https://github.com/amr-amr/am-i-the-asshole" target="_blank" rel="nofollow noopener noreferrer">deep learning</a>. With a
dataset this rich, there’s much more to be investigated, including continuing to
refine these existing methods. And there’s almost certainly room to push the
state of the art in asshole detection!</p>
<p>If you're interested in learning more about using Reddit data, check out
<a href="https://pushshift.io/" target="_blank" rel="nofollow noopener noreferrer">pushshift.io</a>, a database that contains basically all of
Reddit's content (so why make this dataset? I wanted to remove some of the
barriers to analyzing text from r/AmItheAsshole by providing an
already-processed and cleaned version of the data that can be downloaded with a
line of code; pushshift takes some work). You might use pushshift's API and/or
praw to augment this dataset in some way- perhaps to compare activity in this
subreddit with another, or broader patterns on Reddit.</p>https://dvc.org/blog/february-20-dvc-heartbeathttps://dvc.org/blog/february-20-dvc-heartbeatMon, 10 Feb 2020 00:00:00 GMT<p>Welcome to the February Heartbeat! This month's featured image is a DVC pipeline
<a href="https://medium.com/nlp-trend-and-review-en/use-dvc-to-version-control-ml-dl-models-bef61dbfe477" target="_blank" rel="nofollow noopener noreferrer">created by one of our users</a>,
which <em>we</em> think resembles a valentine. Here are some more highlights from our
team and our community:</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><strong>Our team is growing!</strong> In early January, DVC gained two new folks: engineer
<a href="https://github.com/skshetry" target="_blank" rel="nofollow noopener noreferrer">Saugat Pachhai</a> and data scientist
<a href="https://twitter.com/andronovhopf" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a>. Saugat, based in Nepal, will
be contributing to core DVC. Elle (that's me!), currently in San Francisco, will
be leading data science projects and outreach with DVC.</p>
<p>We're <strong>gearing up for a spring full of talks</strong> about DVC projects, including
new up-and-coming features for data cataloging and continuous integration. Here
are just a few events that have been added to our schedule:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Machine Learning Prague - March 19</h4>
<div class="elp-description">DVC engineer Pawel Redzynski will talk about open source tools for versioning machine learning projects.</div>
<div class="elp-link">mlprague.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-02-10/mlprague-409b825d8df0cec780675a46f056799a.jpg" alt="Machine Learning Prague - March 19">
</div>
</a>
</section>
<p></p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DivOps 2020 - March 24</h4>
<div class="elp-description">Elle O'Brien is talking about open source software in the growing field of MLOps at this international, remote conference.</div>
<div class="elp-link">https://divops.org/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-02-10/divops_logo-b53c4509a15b5cab656d1c2f21412dfe.png" alt="DivOps 2020 - March 24">
</div>
</a>
</section>
<p></p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Women in Data Science San Diego - May 9</h4>
<div class="elp-description">Elle O'Brien will be delivering a keynote talk about data catalogs and feature stores.</div>
<div class="elp-link">https://www.widsconference.org/</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-02-10/wids-3d1684ad41ad2a2ba1b8f263e88163a7.jpeg" alt="Women in Data Science San Diego - May 9">
</div>
</a>
</section>
<p></p>
<p>-Elle O'Brien was recently accepted to give a keynote at
<a href="https://www.widsconference.org/" target="_blank" rel="nofollow noopener noreferrer">Women in Data Science</a> San Diego on May 9. The
talk is called "Packaging data and machine learning models for sharing."</p>
<p>-Elle will also be speaking at <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">Div Ops</a>, a new online
conference about (you guessed it) DevOps, on March 27.</p>
<p>Look out for more conference announcements soon- in our <strong>brand new community
page!</strong> We've <a href="https://dvc.org/community" target="_blank" rel="nofollow noopener noreferrer">just launched a new hub</a> for sharing
events, goings-ons, and ways to contribute to DVC.</p>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Our users continue to put awesome things on the internet. Like this AI blogger
who isn't afraid to wear his heart on his sleeve.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@matlihan/my-favorite-data-science-tool-is-dvc-data-version-control-e6ab8aed24d2" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">My favorite data science tool is DVC - Data Version Control</h4>
<div class="elp-description">by Musa Atlıhan</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-02-10/musa_atlihan-a2ebdc84c073c368fb8ed093d3576db0.jpeg" alt="My favorite data science tool is DVC - Data Version Control">
</div>
</a>
</section>
<p></p>
<p>Musa Atlihan writes:</p>
<blockquote>
<p>From my experience, whether it is a real-world data science project or it is a
data science competition, there are two major key components for success.
Those components are API simplicity and reproducible pipelines. Since data
science means experimenting a lot in a limited time frame, first, we need
machine learning tools with simplicity and second, we need
reliable/reproducible machine learning pipelines. Thanks to tools like Keras,
LightGBM, and fastai we already have simple yet powerful tools for rapid model
development. And thanks to DVC, we are building large projects with
reproducible pipelines very easily.</p>
</blockquote>
<p>It's cool how Musa puts DVC in context with libraries for model building. In a
way, the libraries that have made it easier than ever to iterate through
different model architectures have increased the need for reproducibility in
proportion.</p>
<p>Meanwhile in Germany, superusers Marcel Mikl and Bert Besser wrote
<a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">another</a> seriously
comprehensive article about DVC for Codecentric. Marcel and Bert walk readers
through the steps to <strong>build a custom machine learning training pipeline with
remote computing resources</strong> like GCP and AWS. It's an excellent guide to
configuring model training with attention to <em>automation</em> and <em>collaboration</em>.
We give them 🦉🦉🦉🦉🦉 out of 5.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Remote training with GitLab-CI and DVC</h4>
<div class="elp-description">by Marcel Mikl and Bert Besser</div>
<div class="elp-link">blog.codecentric.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-02-10/marcel-176e6cfec67a3f909f1f2c8b58615383.png" alt="Remote training with GitLab-CI and DVC">
</div>
</a>
</section>
<p></p>
<p>Here are a few more stories on our radar:</p>
<ul>
<li>
<p><strong>AI Singapore shares their method for AI development and deployment.</strong> This
..
<a href="https://makerspace.aisingapore.org/2020/01/agile-ai-engineering-in-aisg/" target="_blank" rel="nofollow noopener noreferrer">blog about how Agile informs their processes</a>
for continuous integration and delivery includes data versioning.</p>
</li>
<li>
<p><strong>Toucan AI dispenses advice for ML engineers.</strong> This ..
<a href="https://toucanai.com/blog/post/building-production-ml/" target="_blank" rel="nofollow noopener noreferrer">blog for practitioners</a>
discusses questions like, "When to work on ML vs. the processes that surround
ML". It covers how DVC is used for model versioning in the exploration stage
of ML.</p>
</li>
<li>
<p><strong>DVC at the University.</strong> A recent ..
<a href="https://arxiv.org/pdf/1912.01706.pdf" target="_blank" rel="nofollow noopener noreferrer">pre-print from natural language processing researchers at Université Laval</a>
explains how DVC facilitated dataset access for collaborators.</p>
<blockquote>
<p>"In our case, the original dataset takes up to 6 Gigabytes. The previous way
of retrieving the dataset over the network with a standard 20 Mbits/sec
internet connexion took up to an hour to complete (including uncompressing
the data). Using DVC reduced the retrieval time of the dataset to 3 minutes
over the network with the same internet connexion."</p>
</blockquote>
<p>Thanks for sharing- this is a lovely result. Oh, and last…</p>
</li>
<li>
<p><strong>DVC is a job requirement</strong>! We celebrated a small milestone when we stumbled
.. across a listing for a data engineer to support R&D at
<a href="https://www.elvie.com/en-us/" target="_blank" rel="nofollow noopener noreferrer">Elvie</a>, a maker of tech for women's health
(pretty neat mission). The decorations on the job posting are ours 😎</p>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 470px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f0e8a9d4e7525ba2c56504833e14c3cd/39600/elvie.png" alt="elvie" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>A job advertisement featuring DVC.</em></p>https://dvc.org/blog/gsoc-ideas-2020https://dvc.org/blog/gsoc-ideas-2020Tue, 04 Feb 2020 00:00:00 GMT<p>Announcement, announcement! After a successful experience with
<a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a> in 2019,
we're putting out a call for students to apply to work with DVC as part of
<a href="https://summerofcode.withgoogle.com/" target="_blank" rel="nofollow noopener noreferrer">Google Summer of Code</a>. If you want to
make a dent in open source software development with mentorship from our team,
read on.</p>
<h2 id="prerequisites-to-apply" style="position:relative;">Prerequisites to apply<a href="#prerequisites-to-apply" aria-label="prerequisites to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Besides the general requirements to apply to Google Summer of Code, there are a
few skills we look for in applicants.</p>
<ol>
<li><strong>Python experience.</strong> All of our core development is done in Python, so we
prefer candidates that are experienced in Python. However, we will consider
applicants who are very strong in another language and familiar with Python
basics.</li>
<li><strong>Git experience.</strong> Git is also a key part of DVC development, as DVC is
built around Git; that said, for certain projects (rated as “Beginner”) a
surface-level knowledge of Git will be sufficient.</li>
<li><strong>People skills.</strong> Beyond technical fundamentals, we put a high value on
communication skills: the ability to report and document your experiments and
findings, to work kindly with teammates, and explain your goals and work
clearly.</li>
</ol>
<p>If you like our mission but aren't sure if you're sufficiently prepared, please
be in touch anyway. We'd love to hear from you.</p>
<h2 id="project-ideas" style="position:relative;">Project ideas<a href="#project-ideas" aria-label="project ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Below are several project ideas that are an immediate priority for the core DVC
team. Of course,we welcome students to create their own proposals, even if they
differ from our ideas. Projects will be primarily mentored by co-founders
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> and
<a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">Ivan Shcheklein</a>.</p>
<ol>
<li>
<p><strong>Migrate to the latest v3 API to improve Google Drive support.</strong> Our
organization is a co-maintainer of the PyDrive library in collaboration with
a team at Google. The PyDrive library is now several years old and still
relies on the v2 protocol. We would like to migrate to v3, which we expect
will boost performance for many DVC use cases (e.g. the ability to filter
fields being retrieved from our API, etc). For this project, we’re looking
for a student to work with us to prepare the next major version of the
PyDrive library, as well as making important changes to the core DVC code to
support it. Because PyDrive is broadly used outside of DVC, this project is a
chance to work on a library of widespread interest to the Python community.
<br> <br> <em>Skills required:</em> Python, Git, experience with APIs <br>
<em>Difficulty rating:</em> Beginner-Medium <br></p>
</li>
<li>
<p><strong>Introducing parallelism to DVC.</strong> One of DVC’s features is the ability to
create pipelines, linking data repositories with code to process data, train
models, and evaluate model metrics. Once a DVC pipeline is created, the
pipeline can be shared and re-run in a systematic and entirely reproducible
way. Currently, DVC executes pipelines sequentially, even though some steps
may be run in parallel (such as data preprocessing). We would like to support
parallelization for pipeline steps specified by the user. Furthermore, we’ll
need to support building flags into DVC commands that specify the level of
parallelization (CPU, GPU or memory). <br> <br> <em>Skills required:</em>
Python, Git. Some experience with parallelization and/or scientific computing
would be helpful but not required. <br> <em>Difficulty rating:</em> Advanced
<br></p>
</li>
<li>
<p><strong>Developing use cases for data registries and ML model zoos.</strong> A new DVC
functionality that we’re particularly excited about is <code>summon</code>, a method
that can turn remotely-hosted machine learning artifacts such as datasets,
trained models, and more into objects in the user’s local environment (such
as a Jupyter notebook). This is a foundation for creating data catalogs of
data-frames and machine learning model zoos on top of Git repositories and
cloud storages (like GCS or S3). We need to identify and implement model zoos
(think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo)
and data registries for types that are not supported by DVC yet. Currently,
we’ve tested <code>summon</code> with PyTorch image segmentation models and Pandas
dataframes. We’re looking for students to explore other possible use cases.
<br> <br> <em>Skills required:</em> Python, Git, and some machine learning or
data science experience <br> <em>Difficulty rating:</em> Beginner-Medium <br></p>
</li>
<li>
<p><strong>Continuous delivery for JetBrains TeamCity.</strong> Continuous integration and
continuous delivery (CI/CD) for ML projects is an area where we see
<a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">DVC make a big impact</a>-
specifically, by delivering datasets and ML models into CI/CD pipelines.
While there are many cases when DVC is used inside GitHub Actions and GitLab
CI, you will be transferring this experience to another type of CI/CD system,
<a href="https://www.jetbrains.com/teamcity/" target="_blank" rel="nofollow noopener noreferrer">JetBrains TeamCity</a>. We're working to
integrate DVC's model and dataset versioning into TeamCity's CI/CD toolkit.
This project would be ideal for a student looking to explore the growing
field of MLOps, an offshoot of DevOps with the specifics of ML projects at
the center. <br> <br> <em>Skills required:</em> Python, Git, bash scripting. It
would be nice, but not necessary, to have some experience with CI/CD tools
and developer workflow automation. <br> <em>Difficulty rating:</em>
Medium-Advanced <br></p>
</li>
<li>
<p><strong>DVC performance testing framework.</strong> Performance is a core value of DVC. We
will be creating a performance monitoring and testing framework where new
scenarios (e.g., unit testing)can be populated. The framework should reflect
all performance improvements and degradations for each of the DVC releases.
It would be especially compelling if testing could be integrated with our
GitHub workflow (CI/CD). This is a great opportunity for a student to learn
about DVC and versioning in-depth and contribute to its stability. <br>
<br> <em>Skills required:</em> Python, Git, bash scripting. <br> <em>Difficulty
rating:</em> Medium-Advanced <br></p>
</li>
</ol>
<h2 id="if-youd-like-to-apply" style="position:relative;">If you'd like to apply<a href="#if-youd-like-to-apply" aria-label="if youd like to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Please refer to the
<a href="https://summerofcode.withgoogle.com/" target="_blank" rel="nofollow noopener noreferrer">Google Summer of Code</a> application guides
for specifics of the program. Students looking to know more about DVC, and our
worldwide community of contributors, will learn most by visiting our
<a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>,
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a>, and
<a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Forum</a>. We are available to discuss project proposals
from interested students and can be reached by <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a>
or on our Discord channel.</p>https://dvc.org/blog/january-20-community-gemshttps://dvc.org/blog/january-20-community-gemsMon, 20 Jan 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There's a lot of action in our Discord channel these days. Ruslan, DVC's core
maintainer, said it best with a gif.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">How it feels when <a href="https://twitter.com/DVCorg">@DVCorg</a> team is handling multiple conversations on Discord at the same time. <a href="https://t.co/QrLusdWYml">https://t.co/QrLusdWYml</a></p>— Ruslan Kuprieiev 🇺🇦 (@rkuprieiev) <a href="https://twitter.com/rkuprieiev/status/1144008869414342658">June 26, 2019</a></blockquote>
<p>It's a lot to keep up with, so here are some highlights. We think these are
useful, good-to-know, and interesting conversations between DVC developers and
users.</p>
<h3 id="q-what-pros-does-dvc-have-compared-to-git-lfs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/657590900754612284" target="_blank" rel="nofollow noopener noreferrer">What pros does DVC have compared to Git LFS?</a><a href="#q-what-pros-does-dvc-have-compared-to-git-lfs" aria-label="q what pros does dvc have compared to git lfs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>For an in-depth answer, check out this
<a href="https://stackoverflow.com/questions/58541260/difference-between-git-lfs-and-dvc" target="_blank" rel="nofollow noopener noreferrer">Stack Overflow discussion</a>.
But in brief, with DVC you don't need a special server, and you can use nearly
any kind of storage (S3, Google Cloud Storage, Azure Blobs, your own server,
etc.) without a fuss. There are also no limits on the size of the data that you
can store, unlike with GitHub. With Git LFS, there are some general LFS server
limits, too. DVC has additional features for sharing your data (e.g.,
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) and has pipeline support, so it does much more than LFS. Plus, we
have flexible and quick checkouts, as we utilize different link types (reflinks,
symlinks, and hardlinks). We think there are lots of advantages; of course, the
usefulness will depend on your particular needs.</p>
<h3 id="q-how-do-i-use-dvc-with-ssh-remote-storage-i-usually-connect-with-a-pem-key-file-how-do-i-do-the-same-with-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/656016145119182849" target="_blank" rel="nofollow noopener noreferrer">How do I use DVC with SSH remote storage?</a> I usually connect with a .pem key file. How do I do the same with DVC?<a href="#q-how-do-i-use-dvc-with-ssh-remote-storage-i-usually-connect-with-a-pem-key-file-how-do-i-do-the-same-with-dvc" aria-label="q how do i use dvc with ssh remote storage i usually connect with a pem key file how do i do the same with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC is built to work with the SSH protocol to access remote storage (we provide
some
<a href="https://dvc.org/doc/user-guide/external-dependencies#ssh" target="_blank" rel="nofollow noopener noreferrer">examples in our official documentation</a>).
When SSH requires a key file, try this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote keyfile <span class="token operator"><</span>path to *.pem<span class="token operator">></span></span></code></pre></div>
<h3 id="q-if-you-train-a-tensorflow-model-that-creates-multiple-checkpoint-files-how-do-you-establish-them-as-dependencies-in-the-dvc-pipeline" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/651098762466426891" target="_blank" rel="nofollow noopener noreferrer">If you train a TensorFlow model that creates multiple checkpoint files, how do you establish them as dependencies in the DVC pipeline?</a><a href="#q-if-you-train-a-tensorflow-model-that-creates-multiple-checkpoint-files-how-do-you-establish-them-as-dependencies-in-the-dvc-pipeline" aria-label="q if you train a tensorflow model that creates multiple checkpoint files how do you establish them as dependencies in the dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can specify a directory as a dependency/output in your DVC pipeline, and
store checkpointed models in that directory. It might look like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token punctuation">\</span>
<span class="token parameter variable">-f</span> train.dvc <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> data <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> models python code/train.py</span></code></pre></div>
<p>where <code>models</code> is a directory created for checkpoint files. If you would like to
preserve your models in the data directory, though, then you would need to
specify them one by one. You can do this with bash:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token variable"><span class="token variable">$(</span><span class="token keyword">for</span> <span class="token for-or-select variable">file</span> <span class="token keyword">in</span> data/*.gz<span class="token punctuation">;</span> <span class="token keyword">do</span> <span class="token builtin class-name">echo</span> <span class="token parameter variable">-n</span> <span class="token parameter variable">-d</span> $file<span class="token punctuation">;</span> <span class="token keyword">done</span><span class="token variable">)</span></span></span></code></pre></div>
<p>Be careful, though: if you declare checkpoint files to be an output of the DVC
pipeline, you won’t be able to re-run the pipeline using those checkpoint files
to initialize weights for model training. This would introduce circularity, as
your output would become your input.</p>
<p>Also keep in mind that whenever you re-run a pipeline with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, outputs
are deleted and then regenerated. If you don't wish to automatically delete
outputs, there is a <code>--persist</code> flag (see discussion
<a href="https://github.com/iterative/dvc/issues/1214" target="_blank" rel="nofollow noopener noreferrer">here</a> and
<a href="https://github.com/iterative/dvc/issues/1884" target="_blank" rel="nofollow noopener noreferrer">here</a>), although we don't
currently provide technical support for it.</p>
<p>Finally, remember that setting something as a dependency (<code>-d</code>) doesn't mean it
is automatically tracked by DVC. So remember to <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> data files in the
beginning!</p>
<h3 id="q-is-it-possible-to-use-the-same-cache-directory-for-multiple-dvc-repos-that-are-used-in-parallel-or-do-i-need-external-software-to-prevent-potential-race-conditions" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/655012135973158942" target="_blank" rel="nofollow noopener noreferrer">Is it possible to use the same cache directory for multiple DVC repos that are used in parallel?</a> Or do I need external software to prevent potential race conditions?<a href="#q-is-it-possible-to-use-the-same-cache-directory-for-multiple-dvc-repos-that-are-used-in-parallel-or-do-i-need-external-software-to-prevent-potential-race-conditions" aria-label="q is it possible to use the same cache directory for multiple dvc repos that are used in parallel or do i need external software to prevent potential race conditions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is absolutely possible, and you don't need any external software to safely
use multiple DVC repos in parallel. With DVC, cache operations are atomic. The
only exception is cleaning the cache with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>, which you should only run
when no one else is working on a shared project that is referenced in your cache
(and also, be sure to use the <code>--projects</code> flag
<a href="https://dvc.org/doc/command-reference/gc" target="_blank" rel="nofollow noopener noreferrer">as described in our docs</a>). For more
about using multiple DVC repos in parallel, check out some discussions
<a href="https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180" target="_blank" rel="nofollow noopener noreferrer">here</a>
and
<a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-what-are-some-strategies-for-reproducibility-if-parts-of-our-model-training-pipeline-are-run-on-our-organizationss-hpc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/652380507832844328" target="_blank" rel="nofollow noopener noreferrer">What are some strategies for reproducibility if parts of our model training pipeline are run on our organizations's HPC?</a><a href="#q-what-are-some-strategies-for-reproducibility-if-parts-of-our-model-training-pipeline-are-run-on-our-organizationss-hpc" aria-label="q what are some strategies for reproducibility if parts of our model training pipeline are run on our organizationss hpc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Using DVC for version control is entirely compatible with using remote computing
resources, like high performance computing (HPC), in your model training
pipeline. We think a great example of using DVC with parallel computing is
provided by <a href="http://www.peterfogh.dk/" target="_blank" rel="nofollow noopener noreferrer">Peter Fogh</a> Take a
<a href="https://github.com/PeterFogh/dvc_dask_use_case" target="_blank" rel="nofollow noopener noreferrer">look at his repo</a> for a
detailed use case. Please keep us posted about how HPC works in your pipeline,
as we'll be eager to pass on any insights to the community.</p>
<h3 id="q-say-i-have-a-git-repository-with-multiple-projets-inside-one-classification-one-object-detection-etc-is-it-possible-to-tell-dvc-to-just-pull-data-for-one-particular-project" style="position:relative;">Q: Say I have a Git repository with multiple projets inside (one classification, one object detection, etc.). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/646760832616890408" target="_blank" rel="nofollow noopener noreferrer">Is it possible to tell DVC to just pull data for one particular project?</a><a href="#q-say-i-have-a-git-repository-with-multiple-projets-inside-one-classification-one-object-detection-etc-is-it-possible-to-tell-dvc-to-just-pull-data-for-one-particular-project" aria-label="q say i have a git repository with multiple projets inside one classification one object detection etc is it possible to tell dvc to just pull data for one particular project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Absolutely, DVC supports pulling data from different DVC files. An example would
be having two project subdirectories in your Git repo, <code>classification</code> and
<code>detection</code>. You could use <a href="https://dvc.org/doc/command-reference/pull#-R"><code>dvc pull -R classification</code></a> to only pull files in
that project to your workspace.</p>
<p>If you prefer to be even more granular, you can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> files individually.
Then you can use <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull <filename>.dvc</code></a> to retrieve the outputs specified
only by that file.</p>
<h3 id="q-is-it-possible-to-set-an-s3-remote-without-the-use-of-aws-credentials-with-dvc-i-want-to-publicly-host-a-dataset-so-that-everybody-who-clones-my-code-repo-can-just-run-dvc-pull-to-fetch-the-dataset" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/623234659098296348" target="_blank" rel="nofollow noopener noreferrer">Is it possible to set an S3 remote without the use of AWS credentials with DVC?</a> I want to publicly host a dataset so that everybody who clones my code repo can just run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to fetch the dataset.<a href="#q-is-it-possible-to-set-an-s3-remote-without-the-use-of-aws-credentials-with-dvc-i-want-to-publicly-host-a-dataset-so-that-everybody-who-clones-my-code-repo-can-just-run-dvc-pull-to-fetch-the-dataset" aria-label="q is it possible to set an s3 remote without the use of aws credentials with dvc i want to publicly host a dataset so that everybody who clones my code repo can just run dvc pull to fetch the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, and we love the idea of publicly hosting a dataset. There are a few ways to
do it with DVC. We use one method in our own DVC project repository on Github.
If you run <code>git clone https://github.com/iterative/dvc</code> and then <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>,
you’ll see that DVC is downloading data from an HTTP repository, which is
actually just an S3 repository that we've granted public HTTP read-access to.</p>
<p>So you would need to configure two remotes in your config file, each pointing to
the same S3 bucket through different protocols. Like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> <span class="token parameter variable">--local</span> myremote s3://bucket/path
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> mypublicemote http://s3-external-1.amazonaws.com/bucket/path</span></code></pre></div>
<p>Here's why this works: the <code>-d</code> flag sets the default remote, and the <code>--local</code>
flag creates a set of configuration preferences that will override the global
settings when DVC commands are run locally and won't be shared through Git (you
can read more about this
<a href="https://dvc.org/doc/command-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>).</p>
<p>This means that even though you and users from the public are accessing the
stored dataset by different protocols (S3 and HTTPS), you'll all run the same
command: <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>.</p>https://dvc.org/blog/january-20-dvc-heartbeathttps://dvc.org/blog/january-20-dvc-heartbeatFri, 17 Jan 2020 00:00:00 GMT<p>Welcome to the New Year! Time for a recap of the last few weeks of activity in
the DVC community.</p>
<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We were honored to be named a <a href="https://ods.ai/awards/2019/" target="_blank" rel="nofollow noopener noreferrer">Project of the Year</a>
by Open Data Science, Russia's largest community of data scientists and machine
learning practitioners. Check out our ⭐️incredibly shiny trophy⭐️!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">DVC is the "project of the year" according to @odsai_en!<br>😱🏆🎉<br>Open Data Science the largest DS community we know, with over 40K active members, great courses and it's own conf Data Fest.<br>Many thanks to the organizers and voters!<br>This is the best surprize gift for the team!!🥳 <a href="https://t.co/LZgewjM582">pic.twitter.com/LZgewjM582</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1209544709930016768">December 24, 2019</a></blockquote>
<p>DVC hit <strong>100 individual contributors</strong> on Github! To celebrate our
100<sup>th</sup> contributor, <a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">Vera Sativa</a>, we
sent her $500 to use on any educational opportunity and her own DeeVee (that's
our rainbow owl). We also awarded educational mini-grants to two of DVC's
biggest contributors, <a href="https://github.com/witiko" target="_blank" rel="nofollow noopener noreferrer">Vít Novotný</a>, and
<a href="https://twitter.com/david_prihoda" target="_blank" rel="nofollow noopener noreferrer">David Příhoda</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 612px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/78b685e283d679c8ebe518ea17520f6d/39600/odd_with_deevee.png" alt="odd with deevee" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera (center, flashing a
peace sign) thanked us with this lovely picture of DeeVee and her team,
<a href="https://odd.co" target="_blank" rel="nofollow noopener noreferrer">Odd Industries</a>. They are making some extremely neat tools for
construction teams using computer vision.</em></p>
<p><strong>We were at PyData LA!</strong> Our fearless leader
<a href="https://www.youtube.com/watch?v=7Wsd6V0k4Oc" target="_blank" rel="nofollow noopener noreferrer">Dmitry gave a talk</a> and we set up
a busy booth to meet with the Pythonistas of Los Angeles. It was a cold and
blustery day, but visitors kept showing up to our semi-outdoor booth. We're sure
they came for the open source version control and not the donuts.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 512px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c827a7148f442ec7b39f79659a697878/03346/py_data1.jpg" alt="py data1" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 512px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/76308821da8925b6cf7540b9b0b1ea3f/03346/py_data2.jpg" alt="py data2" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>The DVC team and PyData
volunteers who heroically staffed our booth in the rain.</em></p>
<p>Our engineer and technical writer Jorge reported:</p>
<blockquote>
<p>We were super happy to meet all kinds of data professionals and enthusiasts in
several fields who are learning and adopting DVC with their teams – including
several working with privacy-sensitive medical records, very cool!</p>
</blockquote>
<hr>
<h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Here are some rumblings from the machine learning (ML) and data science
community that got us talking.</p>
<p><strong>A machine learning software wishlist.</strong> Computer scientist and writer
<a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer">Chip Huyen</a> tweeted about her ML software wishlist
and kicked off a big community discussion.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I've been thinking about the software stack for machine learning. Tools I'd love to see.<br><br>1. Pip for pretrained models.<br>2. Version control for datasets.<br>3. GPU-friendly CI. Travis CI, Circe CI don't support GPUs. Jenkins is a pain.<br>4. Fast dataframes. Why is Pandas so slow?</p>— Chip Huyen (@chipro) <a href="https://twitter.com/chipro/status/1202815757593108480">December 6, 2019</a></blockquote>
<p>Her tweet resonated with a lot of practitioners, who were eager to discuss the
solutions they'd tried. Among the many thoughtful replies and recommendations,
we were thrilled to see DVC mentioned.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">We're using <a href="https://twitter.com/DVCorg">@DVCorg</a> for 2) and it works great. 🙂</p>— Kristijan (@kristijan_moves) <a href="https://twitter.com/kristijan_moves/status/1202879739716870144">December 6, 2019</a></blockquote>
<p>If you haven't already, definitely check out Chip's
<a href="https://twitter.com/chipro/status/1202815757593108480" target="_blank" rel="nofollow noopener noreferrer">thread</a>, and follow her
on Twitter for more excllent, accessible content about ML engineering. We're
thinking hard about these ideas and hope the discussion continues on- and
offline.</p>
<p><strong>A gentle intro to DVC for data scientists.</strong> Scientist
<a href="https://twitter.com/andronovhopf" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> published a code walkthrough
about using DVC to make an image classification project more reproducible.
Specifically, the blog is a case study about version control when a dataset
grows over time. If you're looking for a DVC tutorial geared for data
scientists, this might be up your alley.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/start-version-controlling-your-machine-learning-datasets-2b872e109856" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Start Version Controlling your Machine Learning Datasets</h4>
<div class="elp-description">Make your machine learning and data science projects reproducible with open source tools.</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-01-17/medium_1-65307a03bcb030a905958954107696f0.png" alt="Start Version Controlling your Machine Learning Datasets">
</div>
</a>
</section>
<p></p>
<p><strong>Ideas for data scientists to level up their code</strong> Machine learning engineer
Andrew Greatorex posted a blog called “Down with technical debt! Clean Python
for data scientists.” Andrew highlights something we can easily relate to: the
“science” part of data science, which encourages experimentation and
flexibility, sometimes means less emphasis on readable, shareable code. Andrew
writes:</p>
<blockquote>
<p>"I’m hoping to shed light on some of the ways that more fledgling data
scientists can write cleaner Python code and better structure small scale
projects, with the important side effect of reducing the amount of technical
debt you inadvertently burden on yourself and your team.”</p>
</blockquote>
<p>In this blog, DVC gets a shout-out as Andrew’s preferred data versioning tool,
used in conjunction with Git for versioning Python code. Thanks!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Down with technical debt! Clean Python for data scientists.</h4>
<div class="elp-description"></div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-01-17/medium_2-3ad04f500f8cbbe7108a635482d68baf.png" alt="Down with technical debt! Clean Python for data scientists.">
</div>
</a>
</section>
<p></p>
<p><strong>An introduction to MLOps</strong> Engineer
<a href="https://twitter.com/elfouly_sharif" target="_blank" rel="nofollow noopener noreferrer">Sharif Elfouly</a> wrote an approachable guide
to thinking about MLOps, the growing field around making ML projects run
efficiently from experimentation to production. He summarises why managing ML
projects can be fundamentally different than traditional software development:</p>
<blockquote>
<p>“The main difference between traditional software and ML is that you don’t
only have the code. You also have data, models, and experiments. Writing
traditional software is relatively straightforward but in ML you need to try
out a lot of different things to find the best and fastest model for your
use-case. You have a lot of different model types to choose from and every
single one of them has its specific hyperparameters. Even if you work alone
this can get out of hand pretty quickly.”</p>
</blockquote>
<p>Sharif gives some recommendations for tools that work especially well for ML,
and he writes that DVC is the “perfect combination for versioning your code and
data.” Thanks, Sharif! We think you’re perfect, too.</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">MLOps Done Right</h4>
<div class="elp-description">What is MLOps? Why is it so important? How to do it right!</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2020-01-17/medium_3-225e59dae2ce1f7bef910517f0bd8ae6.png" alt="MLOps Done Right">
</div>
</a>
</section>
<p></p>
<p>That's a wrap for January. We'll see you next month with more updates!</p>https://dvc.org/blog/november-19-dvc-heartbeathttps://dvc.org/blog/november-19-dvc-heartbeatSat, 14 Dec 2019 00:00:00 GMT<p>The past few months have been so busy and full of great events! We love how
involved our community is and can’t wait to share more with you:</p>
<ul>
<li>
<p>We have organized our very first
<a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">meetup</a>!
So many great conversations, new use cases and insights! Many thanks to
<a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Dan Fischetti</a> from
<a href="https://standard.ai/" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a>, who joined our Dmitry Petrov on
stage. Watch the recording here.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
</li>
<li>
<p><a href="https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest</a>
was a great exercise for DVC team on many levels and we really enjoyed
supporting new contributors. Kudos to
<a href="https://twitter.com/explorer_07" target="_blank" rel="nofollow noopener noreferrer">Nabanita Dash</a> for organizing a cool
DVC-themed hackathon!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Our open source event Hacktoberfest-themed meet-up was a success. Thanks to <a href="https://twitter.com/DVCorg">@DVCorg</a> and it's mentors for all the hard work. <br>Some of our attendees made their first PR on DVC and got them merged. Kudos to the team! <br>PS: 🍕 was the second best thing of the evening. <a href="https://t.co/zAWC0TVlPd">pic.twitter.com/zAWC0TVlPd</a></p>— Programming Society IIIT-Bh (@psociiit) <a href="https://twitter.com/psociiit/status/1185150096792535040">October 18, 2019</a></blockquote>
</li>
<li>
<p>We’ve crossed 4k stars mark on <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">Github</a>!</p>
</li>
<li>
<p>DVC was participating in the
<a href="https://twitter.com/FossMec/status/1192866498324254720" target="_blank" rel="nofollow noopener noreferrer">Devsprints</a> (Thank
you <a href="https://twitter.com/kurianbenoy2" target="_blank" rel="nofollow noopener noreferrer">Kurian Benoy</a> for the intro!) and we
were happy to jump in and help with some mentoring.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Thank you <a href="https://twitter.com/DVCorg">@DVCorg</a> for participating in the Devsprints, by <a href="https://twitter.com/FossMec">@FossMEC</a> and <a href="https://twitter.com/excelmec">@excelmec</a>. We had <a href="https://twitter.com/shcheklein">@shcheklein</a> who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and <a href="https://twitter.com/kurianbenoy2">@kurianbenoy2</a> helped participants get started to contributing to the project.</p>— FOSS MEC (@FossMec) <a href="https://twitter.com/FossMec/status/1192866498324254720">November 8, 2019</a></blockquote>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/39600/devsprints.png" alt="devsprints" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Devsprints participants on our
<a href="http://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> channel</em></p>
<ul>
<li>
<p>DVC became part of the default
<a href="https://formulae.brew.sh/formula/dvc" target="_blank" rel="nofollow noopener noreferrer">Homebrew formulae</a>! So now you can
install it as easy as <code>brew install dvc</code>!</p>
</li>
<li>
<p>We helped 2 aspiring speakers deliver their very first conference talks.
<a href="https://twitter.com/kurianbenoy2/status/1183427495342694401?s=20" target="_blank" rel="nofollow noopener noreferrer">Kurian Benoy</a>
was speaking at <a href="https://in.pycon.org/2019/" target="_blank" rel="nofollow noopener noreferrer">PyconIndia</a> and
<a href="https://www.linkedin.com/in/aman-sharma606/" target="_blank" rel="nofollow noopener noreferrer">Aman Sharma</a> was speaking at
<a href="https://scipy.in/2019#speakers" target="_blank" rel="nofollow noopener noreferrer">SciPyIndia</a>. <strong>Supporting speakers is
something we are passionate about and if you ever wanted to give a talk on a
DVC-related topic — we are here to help, just
<a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">let us know</a>!</strong></p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Ipzf6oQqQpo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
</li>
<li>
<p>Our own <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> went to Europe to
speak at the
<a href="https://osseu19.sched.com/speaker/dmitry35" target="_blank" rel="nofollow noopener noreferrer">Open Source Summit Europe</a> in
Lyon, <a href="https://www.highload.ru/moscow/2019/abstracts/6032" target="_blank" rel="nofollow noopener noreferrer">Highload++</a> in
Moscow and made a stop in in Berlin to co-host a
<a href="https://www.meetup.com/codecentric-Berlin/events/265555810/" target="_blank" rel="nofollow noopener noreferrer">meetup</a> with our
favourite AI folks from <a href="https://www.codecentric.de/" target="_blank" rel="nofollow noopener noreferrer">Codecentric</a>!</p>
</li>
</ul>
<hr>
<p>Here are some of the great pieces of content around DVC and ML ops that we
discovered in October and November:</p>
<ul>
<li><strong><a href="https://www.deploymachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Deploy Machine Learning Models with Django</a>
by Piotr Płoński.</strong></li>
</ul>
<blockquote>
<p>…building your ML system has a great advantage — it is tailored to your needs.
It has all features that are needed in your ML system and can be as complex as
you wish. This tutorial is for readers who are familiar with ML and would like
to learn how to build ML web services.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://www.deploymachinelearning.com/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Deploy Machine Learning Models with Django</h4>
<div class="elp-description">Version 1.0 (04/11/2019) Piotr Płoński The demand for Machine Learning (ML) applications is growing. Many resources…</div>
<div class="elp-link">deploymachinelearning.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-12-14/deploy-machine-learning-models-7cb0da3c9268f3e33159cdd160d56e13.png" alt="Deploy Machine Learning Models with Django">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0" target="_blank" rel="nofollow noopener noreferrer">How to Manage Your Machine Learning Workflow with DVC, Weights & Biases, and Docker</a>
by <a href="https://le-james94.medium.com" target="_blank" rel="nofollow noopener noreferrer">James Le</a>.</strong></li>
</ul>
<blockquote>
<p>In this article, I want to show 3 powerful tools to simplify and scale up
machine learning development within an organization by making it easy to
track, reproduce, manage, and deploy models.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">How to Manage Your Machine Learning Workflow withDVC, Weights & Biases,
and Docker</h4>
<div class="elp-description">Managing a machine learning workflow is hard!</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-12-14/how-to-manage-your-machine-learning-workflow-c8cf3d6c0b055c1bd0d9bfa4f8e6d4da.jpeg" alt="How to Manage Your Machine Learning Workflow withDVC, Weights & Biases,
and Docker">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" target="_blank" rel="nofollow noopener noreferrer">Creating a solid Data Science development environment</a>
by
<a href="https://towardsdatascience.com/@gabrielsgoncalves" target="_blank" rel="nofollow noopener noreferrer">Gabriel dos Santos Goncalves</a></strong></li>
</ul>
<blockquote>
<p>We do believe that Data Science is a field that can become even more mature by
using best practices in project development and that Conda, Git, DVC, and
JupyterLab are key components of this new approach</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Creating a solid Data Science development environment</h4>
<div class="elp-description">How to organize and replicate your development environment using Conda, Git, DVC, and JupyterLab.</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-12-14/creating-solid-data-science-dev-env-48e9ffb886f0ec17a2cfc0deab709dc6.png" alt="Creating a solid Data Science development environment">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" target="_blank" rel="nofollow noopener noreferrer">Creating reproducible data science workflows with DVC</a>
by <a href="https://medium.com/@glib.ivashkevych" target="_blank" rel="nofollow noopener noreferrer">Gleb Ivashkevich</a>.</strong></li>
</ul>
<blockquote>
<p>DVC is a powerful tool and we covered only the fundamentals of it.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Creating reproducible data science workflows with DVC</h4>
<div class="elp-description">Getting started” tutorial into DVC to make a structure and order in your daily ML routine</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-12-14/creating-reproducible-data-science-workflows-60ff778cfaeb3cacd8fe82988fe696eb.jpeg" alt="Creating reproducible data science workflows with DVC">
</div>
</a>
</section>
<p></p>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-when-you-do-a-dvc-import-you-get-the-state-of-the-data-in-the-original-repo-at-that-moment-in-time-from-that-repo-right-the-overall-state-of-that-repo-eg-git-commit-id-hash-is-not-preserved-upon-import-right" style="position:relative;">Q: When you do a <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> you get the state of the data in the original repo at that moment in time from that repo, right? <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/618744949277458462" target="_blank" rel="nofollow noopener noreferrer">The overall state of that repo (e.g. Git <code>commit id</code> (hash)) is not preserved upon import, right?</a><a href="#q-when-you-do-a-dvc-import-you-get-the-state-of-the-data-in-the-original-repo-at-that-moment-in-time-from-that-repo-right-the-overall-state-of-that-repo-eg-git-commit-id-hash-is-not-preserved-upon-import-right" aria-label="q when you do a dvc import you get the state of the data in the original repo at that moment in time from that repo right the overall state of that repo eg git commit id hash is not preserved upon import right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>On the contrary, DVC relies on Git <code>commit id</code> (hash) to determine the state of
the data as well as code. Git <code>commit id</code> (hash) is saved in DVC file upon
import, data itself is copied/downloaded into DVC repo cache but would not be
pushed to the remote — DVC does not create duplicates. There is a command to
advance/update it when it’s needed — <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>. Git commit hash saved to
provide reproducibility. Even if the source repo <code>HEAD</code> has changed your import
stays the same until you run <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a> or redo <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>.</p>
<h3 id="q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" style="position:relative;">Q: I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340" target="_blank" rel="nofollow noopener noreferrer">That means that permanent deletion of files with sensitive data needs to be fully supported.</a><a href="#q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" aria-label="q im trying to understand if dvc is an appropriate solution for storing data under gdpr requirements that means that permanent deletion of files with sensitive data needs to be fully supported permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, in this sense DVC is not very different from using bare S3, SSH or any
other storage where you can go and just delete data. DVC can give a bit of
overhead to locate a specific file to delete, but otherwise it’s all the same
you will be able to delete any file you want. Read more details in
<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987" target="_blank" rel="nofollow noopener noreferrer">this discussion</a>.</p>
<h3 id="q-is-there-anyway-to-get-the-remote-url-for-specific-dvc-files-say-i-have-a-dvc-file-foopngdvc--is-there-a-command-that-will-show-the-remote-url-something-like-dvc-get-remote-url-foopngdvc-which-will-return-eg-the-azure-url-to-download" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621591769766821888" target="_blank" rel="nofollow noopener noreferrer">Is there anyway to get the remote url for specific DVC-files?</a> Say, I have a DVC-file <code>foo.png.dvc</code> — is there a command that will show the remote url, something like <code>dvc get-remote-url foo.png.dvc</code> which will return e.g. the Azure url to download.<a href="#q-is-there-anyway-to-get-the-remote-url-for-specific-dvc-files-say-i-have-a-dvc-file-foopngdvc--is-there-a-command-that-will-show-the-remote-url-something-like-dvc-get-remote-url-foopngdvc-which-will-return-eg-the-azure-url-to-download" aria-label="q is there anyway to get the remote url for specific dvc files say i have a dvc file foopngdvc is there a command that will show the remote url something like dvc get remote url foopngdvc which will return eg the azure url to download permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There is no special command for that, but if you are using Python, you could use
our API specifically designed for that:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> get_url
url <span class="token operator">=</span> get_url<span class="token punctuation">(</span>path<span class="token punctuation">,</span>
repo<span class="token operator">=</span><span class="token string">"https://github.com/user/proj"</span><span class="token punctuation">,</span>
rev<span class="token operator">=</span><span class="token string">"mybranch"</span><span class="token punctuation">)</span></code></pre></div>
<p>so, you could as well use this from CLI as a wrapper command.</p>
<h3 id="q-can-dvc-be-integrated-with-ms-active-directory-ad-authentication-for-controlling-access-the-gdpr-requirements-would-force-me-to-use-such-a-system-to-manage-access" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/619244714071425035" target="_blank" rel="nofollow noopener noreferrer">Can DVC be integrated with MS Active Directory (AD) authentication for controlling access?</a> The GDPR requirements would force me to use such a system to manage access.<a href="#q-can-dvc-be-integrated-with-ms-active-directory-ad-authentication-for-controlling-access-the-gdpr-requirements-would-force-me-to-use-such-a-system-to-manage-access" aria-label="q can dvc be integrated with ms active directory ad authentication for controlling access the gdpr requirements would force me to use such a system to manage access permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Short answer: no (as of the date of publishing this Heartbeat issue) Good news —
it should be very easy to add, so we would welcome a contribution :) Azure has a
connection argument for AD — quick googling shows this
<a href="https://github.com/AzureAD/azure-activedirectory-library-for-python" target="_blank" rel="nofollow noopener noreferrer">library</a>,
which is what probably needed.</p>
<h3 id="q-how-do-i-uninstall-dvc-from-mac-installed-as-a-package" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/625124341201502209" target="_blank" rel="nofollow noopener noreferrer">How do I uninstall DVC from Mac installed as a package?</a><a href="#q-how-do-i-uninstall-dvc-from-mac-installed-as-a-package" aria-label="q how do i uninstall dvc from mac installed as a package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When installing using <code>plain.pkg</code> it is a bit tricky to uninstall, so we usually
recommend using things like brew cask instead if you really need the binary
package. Try to run these commands:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/bin/dvc
</span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/lib/dvc
</span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pkgutil <span class="token parameter variable">--forget</span> com.iterative.dvc</span></code></pre></div>
<p>to uninstall the package.</p>
<h3 id="q-we-are-using-ssh-remote-to-store-data-but-the-problem-is-that-everyone-within-the-project-has-different-username-on-the-remote-machine-and-thus-we-cannot-set-it-in-the-config-file-that-is-committed-to-git-is-there-a-way-to-add-just-host-and-path-without-the-username" style="position:relative;">Q: We are using SSH remote to store data, but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file (that is committed to Git). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/619420070111608848" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add just host and path, without the username?</a><a href="#q-we-are-using-ssh-remote-to-store-data-but-the-problem-is-that-everyone-within-the-project-has-different-username-on-the-remote-machine-and-thus-we-cannot-set-it-in-the-config-file-that-is-committed-to-git-is-there-a-way-to-add-just-host-and-path-without-the-username" aria-label="q we are using ssh remote to store data but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file that is committed to git is there a way to add just host and path without the username permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, you should use <code>--local</code> or <code>--global</code> config options to set user per
project or per use machine without sharing (committing) them to Git:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote —local user myuser</span></code></pre></div>
<p>or</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote —global user myuser</span></code></pre></div>
<h3 id="q-i-still-get-the-ssl-error-when-i-try-to-perform-a-dvc-push-with-or-without-use_ssl--false" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/628227197592797191" target="_blank" rel="nofollow noopener noreferrer">I still get the <code>SSL ERROR</code> when I try to perform a dvc push with or without <code>use_ssl = false</code></a>?<a href="#q-i-still-get-the-ssl-error-when-i-try-to-perform-a-dvc-push-with-or-without-use_ssl--false" aria-label="q i still get the ssl error when i try to perform a dvc push with or without use_ssl false permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>A simple environment variable like this:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">AWS_CA_BUNDLE</span><span class="token operator">=</span>/path/to/cert/cert.crt dvc push</span></code></pre></div>
<p>should do the trick for now, we plan to fix the ca_bundle option soon.</p>
<h3 id="q-i-have-just-finished-a-lengthy-dvc-repro-and-im-happy-with-the-result-however-i-realized-that-i-didnt-specify-a-dependency-which-i-needed-and-obviously-is-used-in-the-computation-can-i-somehow-fix-it" style="position:relative;">Q: I have just finished a lengthy <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> and I’m happy with the result. However, I realized that I didn’t specify a dependency which I needed (and obviously is used in the computation). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620572187841265675" target="_blank" rel="nofollow noopener noreferrer">Can I somehow fix it?</a><a href="#q-i-have-just-finished-a-lengthy-dvc-repro-and-im-happy-with-the-result-however-i-realized-that-i-didnt-specify-a-dependency-which-i-needed-and-obviously-is-used-in-the-computation-can-i-somehow-fix-it" aria-label="q i have just finished a lengthy dvc repro and im happy with the result however i realized that i didnt specify a dependency which i needed and obviously is used in the computation can i somehow fix it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Add the dependency to the stage file without rerunning/reproducing the stage.
This is not needed as this additional dependency hasn’t changed.</p>
<p>You would need to edit the DVC-file. In the deps section add:</p>
<div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">-path</span><span class="token punctuation">:</span> not/included/file/path</code></pre></div>
<p>and run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit file.dvc</code></a> to save changes w/o running the pipeline again.
See an example
<a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620641530075414570" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-for-some-reason-we-need-to-always-specify-the-remote-name-when-doing-a-dvc-push-eg-dvc-push--r-upstream-as-opposed-to-dvc-push-mind-no-additional-arguments" style="position:relative;">Q: For some reason <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/629704961868955648" target="_blank" rel="nofollow noopener noreferrer">we need to always specify the remote name when doing a <code>dvc push</code></a> e.g., <a href="https://dvc.org/doc/command-reference/push#-r"><code>dvc push -r upstream</code></a> as opposed to <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> (mind no additional arguments).<a href="#q-for-some-reason-we-need-to-always-specify-the-remote-name-when-doing-a-dvc-push-eg-dvc-push--r-upstream-as-opposed-to-dvc-push-mind-no-additional-arguments" aria-label="q for some reason we need to always specify the remote name when doing a dvc push eg dvc push r upstream as opposed to dvc push mind no additional arguments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can mark a “default” remote:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> remote /path/to/my/main/remote</span></code></pre></div>
<p>then, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> (and other commands like <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>) will know to push to the
default</p>
<h3 id="q-if-i-want-stage-b-to-run-after-stage-a-but-the-stage-a-has-no-output-can-i-specify-as-dvc-file-as-bs-dependency" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620715145374466048" target="_blank" rel="nofollow noopener noreferrer">If I want stage B to run after stage A, but the stage A has no output, can I specify A’s DVC-file as B’s dependency?</a><a href="#q-if-i-want-stage-b-to-run-after-stage-a-but-the-stage-a-has-no-output-can-i-specify-as-dvc-file-as-bs-dependency" aria-label="q if i want stage b to run after stage a but the stage a has no output can i specify as dvc file as bs dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>No, at least at the time of publishing this. You could use a phony output
though. E.g. make the stage A output some dummy file and make B depend on it.
Please, consider creating or upvoting a relevant issue on our Github if you’d
this to be implemented.</p>
<h3 id="q-im-just-getting-started-with-dvc-but-id-like-to-use-it-for-multiple-developers-to-access-the-data-and-share-models-and-code-i-do-own-the-server-but-im-not-sure-how-to-use-dvc-with-ssh-remote" style="position:relative;">Q: I’m just getting started with DVC, but I’d like to use it for multiple developers to access the data and share models and code. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/598867829785362452" target="_blank" rel="nofollow noopener noreferrer">I do own the server, but I’m not sure how to use DVC with SSH remote?</a><a href="#q-im-just-getting-started-with-dvc-but-id-like-to-use-it-for-multiple-developers-to-access-the-data-and-share-models-and-code-i-do-own-the-server-but-im-not-sure-how-to-use-dvc-with-ssh-remote" aria-label="q im just getting started with dvc but id like to use it for multiple developers to access the data and share models and code i do own the server but im not sure how to use dvc with ssh remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Please, refer to
<a href="https://discuss.dvc.org/t/how-do-i-use-dvc-with-ssh-remote/279/2" target="_blank" rel="nofollow noopener noreferrer">this answer</a>
on the DVC forum and check the documentation for the
<a href="https://dvc.org/doc/command-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer"><code>dvc remote add</code></a> and
<a href="https://dvc.org/doc/command-reference/remote/modify" target="_blank" rel="nofollow noopener noreferrer"><code>dvc remote modify</code></a>
commands to see more options and details.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/october-19-dvc-heartbeathttps://dvc.org/blog/october-19-dvc-heartbeatTue, 05 Nov 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Autumn is a great season for new beginnings and there is so much we love about
it this year. Here are some of the highlights:</p>
<ul>
<li>
<p>Co-hosting our
<a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">first ever meetup</a>!
Our <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> partnering with
<a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Dan Fischetti</a> from
<a href="https://twitter.com/standardAI" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a> to discuss Open-source
tools to version control Machine Learning models and experiments. The
recording is available now here.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
</li>
<li>
<p><a href="https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c" target="_blank" rel="nofollow noopener noreferrer">Getting ready for the Hacktoberfest</a>
and having the whole team get together to pick up and label nice issues and be
ready to support the contributors.</p>
</li>
<li>
<p>Discovering some really cool blogposts, talks and tutorials from our users all
over the world: check
<a href="https://blog.octo.com/mise-en-application-de-dvc-sur-un-projet-de-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">this blogpost in French</a>
or
<a href="https://jupyter-tutorial.readthedocs.io/de/latest/productive/dvc/" target="_blank" rel="nofollow noopener noreferrer">this tutorial in German</a>!</p>
</li>
<li>
<p>Having a great time working with a
<a href="https://github.com/dashohoxha" target="_blank" rel="nofollow noopener noreferrer">tech writer</a> brought to us by the
<a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a> program.
Check out these
<a href="https://dvc.org/doc/tutorials/interactive" target="_blank" rel="nofollow noopener noreferrer">interactive tutorials</a> we’ve
created together.</p>
</li>
<li>
<p>Having hot internal discussion about Discord vs Slack support/community
channels. If you are on the edge like us, have a look at
<a href="https://internals.rust-lang.org/t/exploring-new-communication-channels/7859" target="_blank" rel="nofollow noopener noreferrer">this discussion</a>
in the Rust community, so helpful.</p>
</li>
<li>
<p>Seeing <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> being really happy one
day:</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">.<a href="https://twitter.com/martinfowler">@martinfowler</a>'s books and his website were always the source of programming wisdom 💎 His Refactoring book is the first book I recommend to developers.<br><br>Now they write about ML lifecycle and automation. I’m especially excited because they use <a href="https://twitter.com/DVCorg">@DVCorg</a> that we’ve created. <a href="https://t.co/HwswZqjOsb">https://t.co/HwswZqjOsb</a></p>— Dmitry Petrov (@FullStackML) <a href="https://twitter.com/FullStackML/status/1169403554290814976">September 5, 2019</a></blockquote>
</li>
</ul>
<hr>
<p>We at <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a> are so happy every time we discover an article
featuring DVC or addressing one of the burning ML issues we are trying to solve.
Here are some of the links that caught our eye past month:</p>
<ul>
<li><strong>Continuous Delivery for Machine Learning by
<a href="https://twitter.com/dtsato" target="_blank" rel="nofollow noopener noreferrer">Danilo Sato</a>,
<a href="https://twitter.com/arifwider" target="_blank" rel="nofollow noopener noreferrer">Arif Wider</a>,
<a href="https://twitter.com/intellification" target="_blank" rel="nofollow noopener noreferrer">Christoph Windheuser</a> and curated by
<a href="https://martinfowler.com/" target="_blank" rel="nofollow noopener noreferrer">Martin Fowler</a>.</strong></li>
</ul>
<blockquote>
<p>As Machine Learning techniques continue to evolve and perform more complex
tasks, so is evolving our knowledge of how to manage and deliver such
applications to production. By bringing and extending the principles and
practices from Continuous Delivery, we can better manage the risks of
releasing changes to Machine Learning applications in a safe and reliable way.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://martinfowler.com/articles/cd4ml.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Continuous Delivery for Machine Learning</h4>
<div class="elp-description">bio I am a consultant at ThoughtWorks Germany, where I am leading our data and machine learning activities. I enjoy…</div>
<div class="elp-link">martinfowler.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-11-05/continuous-delivery-for-machine-learning-fd9ed27a4534371abdb90fcb1e5d1fb3.png" alt="Continuous Delivery for Machine Learning">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9" target="_blank" rel="nofollow noopener noreferrer">The Path to Identity Validation</a>
by <a href="https://medium.com/@victor.segura" target="_blank" rel="nofollow noopener noreferrer">Víctor Segura</a>.</strong></li>
</ul>
<blockquote>
<p>So, the first question is clear: how to choose the optimal hardware for neural
networks? Secondly, assuming that we have the appropriate infrastructure, how
to build the machine learning ecosystem to train our models efficiently and
not die trying? At <strong>Signaturit</strong>, we have the solution ;)</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The Path to Identity Validation (2/3)</h4>
<div class="elp-description">How to start your own machine learning project?</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-11-05/the-path-to-identity-validation-51339e974d8ad0c70b28bffb4cacd674.jpeg" alt="The Path to Identity Validation (2/3)">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong>Talk:
<a href="https://pretalx.com/pyconuk-2019/talk/GCLBFH/" target="_blank" rel="nofollow noopener noreferrer">Managing Big Data in Machine Learning projects</a>
by <a href="https://twitter.com/vvasworld" target="_blank" rel="nofollow noopener noreferrer">V Vishnu Anirudh</a> at the
<a href="https://2019.pyconuk.org/" target="_blank" rel="nofollow noopener noreferrer">Pycon UK 2019.</a></strong></li>
</ul>
<blockquote>
<p>My talk will focus on Version Control Systems (VCS) for big-data projects.
With the advent of Machine Learning (ML) , the development teams find it
increasingly difficult to manage and collaborate on projects that deal with
huge amounts of data and ML models apart from just source code.</p>
</blockquote>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/4XpHk85_x0E?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<ul>
<li><strong>Podcast: TWIML Talk #295
<a href="https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/" target="_blank" rel="nofollow noopener noreferrer">Managing Deep Learning Experiments</a>
with <a href="https://twitter.com/l2k" target="_blank" rel="nofollow noopener noreferrer">Lukas Biewald</a></strong></li>
</ul>
<blockquote>
<p>Seeing a need for reproducibility in deep learning experiments, Lukas founded
Weights & Biases. In this episode we discuss his experiment tracking tool, how
it works, the components that make it unique in the ML marketplace and the
open, collaborative culture that Lukas promotes. Listen to Lukas delve into
how he got his start in deep learning experiments, what his experiment
tracking used to look like, the current Weights & Biases business success
strategy, and what his team is working on today.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Managing Deep Learning Experiments with Lukas Biewald — Talk #295</h4>
<div class="elp-description">Today we are joined by Lukas Biewald, CEO and Co-Founder of Weights & Biases. Lukas, previously CEO and Founder of…</div>
<div class="elp-link">twimlai.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-11-05/managing-deep-learning-experiments-cadd691f2bc6783395192d5944ad571a.jpeg" alt="Managing Deep Learning Experiments with Lukas Biewald — Talk #295">
</div>
</a>
</section>
<p></p>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-ive-just-run-a-dvc-run-step-and-realised-i-forgot-to-declare-an-output-file-is-there-a-way-to-add-an-output-file-without-rerunning-the-computationally-expensive-stepstage" style="position:relative;">Q: I’ve just run a <code>dvc run</code> step, and realised I forgot to declare an output file. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/593743448020877323" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add an output file without rerunning the (computationally expensive) step/stage?</a><a href="#q-ive-just-run-a-dvc-run-step-and-realised-i-forgot-to-declare-an-output-file-is-there-a-way-to-add-an-output-file-without-rerunning-the-computationally-expensive-stepstage" aria-label="q ive just run a dvc run step and realised i forgot to declare an output file is there a way to add an output file without rerunning the computationally expensive stepstage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you’ve already ran it, you could just open created DVC-file with an editor
and add an entry to the outs field. After that, just run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit my.dvc</code></a> and
it will save the checksums and data without re-running your command.
<code>dvc run --no-exec</code> would also work with commit instead of modifying the
DVC-file by hand.</p>
<h3 id="q-for-metric-files-do-i-have-to-use-dvc-run-to-set-a-metric-or-can-i-do-it-some-other-way-can-i-use-metrics-functionality-without-the-need-to-setup-and-manage-dvc-cache-and-remote-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/593869598651318282" target="_blank" rel="nofollow noopener noreferrer">For metric files do I have to use dvc run to set a metric or can I do it some other way?</a> Can I use metrics functionality without the need to setup and manage DVC cache and remote storage?<a href="#q-for-metric-files-do-i-have-to-use-dvc-run-to-set-a-metric-or-can-i-do-it-some-other-way-can-i-use-metrics-functionality-without-the-need-to-setup-and-manage-dvc-cache-and-remote-storage" aria-label="q for metric files do i have to use dvc run to set a metric or can i do it some other way can i use metrics functionality without the need to setup and manage dvc cache and remote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Any file that is under DVC control (e.g. added with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or an output in
<code>dvc run -o</code>) can be made a metric file with dvc metrics add file. Alternatively
a command <code>dvc run -M</code> file makes file a metric without caching it. It means dvc
metrics show can be used while file is still versioned by Git.</p>
<h3 id="q-is-there-a-way-not-to-add-the-full-azure-connection-string-to-the-dvcconfig-file-that-is-being-checked-into-git-for-using-dvc-remotes-i-think-its-quite-unhealthy-to-have-secrets-checked-in-scm" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/595586670498283520" target="_blank" rel="nofollow noopener noreferrer">Is there a way not to add the full (Azure) connection string to the .dvc/config file that is being checked into Git for using dvc remotes</a>? I think it’s quite unhealthy to have secrets checked in SCM.<a href="#q-is-there-a-way-not-to-add-the-full-azure-connection-string-to-the-dvcconfig-file-that-is-being-checked-into-git-for-using-dvc-remotes-i-think-its-quite-unhealthy-to-have-secrets-checked-in-scm" aria-label="q is there a way not to add the full azure connection string to the dvcconfig file that is being checked into git for using dvc remotes i think its quite unhealthy to have secrets checked in scm permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are two options — use <code>AZURE_STORAGE_CONNECTION_STRING</code> environment
variable or use <code>--local</code> flag that will put it into the <code>.dvc/config.local</code>
that is added to the <code>.gitignore</code>, so you don’t track it with it and so won’t
expose secrets.</p>
<h3 id="q-i-would-like-to-know-if-it-is-possible-to-manage-files-under-dvc-whilst-keeping-them-in-their-original-locations-eg-on-a-network-drive-in-a-given-folder-structure-if-i-want-to-add-a-large-file-to-be-tracked-by-dvc-and-it-is-in-a-bucket-on-s3-or-gcs-can-i-do-that-without-downloading-it-locally" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/601068667131920385" target="_blank" rel="nofollow noopener noreferrer">I would like to know if it is possible to manage files under DVC whilst keeping them in their original locations (e.g. on a network drive in a given folder structure)</a>? <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/615278138896941101" target="_blank" rel="nofollow noopener noreferrer">If I want to add a large file to be tracked by DVC, and it is in a bucket on S3 or GCS, can I do that without downloading it locally?</a><a href="#q-i-would-like-to-know-if-it-is-possible-to-manage-files-under-dvc-whilst-keeping-them-in-their-original-locations-eg-on-a-network-drive-in-a-given-folder-structure-if-i-want-to-add-a-large-file-to-be-tracked-by-dvc-and-it-is-in-a-bucket-on-s3-or-gcs-can-i-do-that-without-downloading-it-locally" aria-label="q i would like to know if it is possible to manage files under dvc whilst keeping them in their original locations eg on a network drive in a given folder structure if i want to add a large file to be tracked by dvc and it is in a bucket on s3 or gcs can i do that without downloading it locally permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, you are probably looking for external dependencies and outputs. This is the
<a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">link</a> to the
documentation to start.</p>
<h3 id="q-how-do-i-setup-dvc-so-that-nas-eg-synology-acts-as-a-shared-dvc-cache" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/606388040377565215" target="_blank" rel="nofollow noopener noreferrer">How do I setup DVC so that NAS (e.g. Synology) acts as a shared DVC cache?</a><a href="#q-how-do-i-setup-dvc-so-that-nas-eg-synology-acts-as-a-shared-dvc-cache" aria-label="q how do i setup dvc so that nas eg synology acts as a shared dvc cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Using NAS (e.g. NFS) is a very common scenario for DVC. In short you use
<a href="https://dvc.org/doc/command-reference/cache/dir"><code>dvc cache dir</code></a> to setup a cache externally. Set cache type to use symlinks and
enable protected mode. We are preparing a
<a href="https://github.com/iterative/dvc.org/blob/31c5d424c6530bb793af69c2af578d2b8a374d02/static/docs/use-cases/shared-storage-on-nfs.md" target="_blank" rel="nofollow noopener noreferrer">document</a>
how to setup the NFS as a shared cache, but I think it can be applied to any
NAS.</p>
<h3 id="q-so-i-have-some-data-that-is-in-the-hundreds-of-gigs-if-i-enable-symlink-hardlink-strategy-and-cache-protecting-will-dvc-automatically-choose-this-strategy-over-copying-when-trying-to-use-dvc-add" style="position:relative;">Q: So I have some data that is in the hundreds of gigs. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/608013531010301952" target="_blank" rel="nofollow noopener noreferrer">If I enable symlink, hardlink strategy and cache protecting, will DVC automatically choose this strategy over copying when trying to use dvc add</a>?<a href="#q-so-i-have-some-data-that-is-in-the-hundreds-of-gigs-if-i-enable-symlink-hardlink-strategy-and-cache-protecting-will-dvc-automatically-choose-this-strategy-over-copying-when-trying-to-use-dvc-add" aria-label="q so i have some data that is in the hundreds of gigs if i enable symlink hardlink strategy and cache protecting will dvc automatically choose this strategy over copying when trying to use dvc add permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, it will! Here is some clarification. So when you set those settings like
that, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> data will move data to your cache and then will create a
hardlink from your cache to your workspace.</p>
<p>Unless your cache directory and your workspace are on different file systems,
move should be instant. Please, find more information
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-my-repos-dvc-is-busy-and-locked-and-im-not-sure-how-it-got-that-way-and-how-to-removediagnose-the-lock-any-suggestions" style="position:relative;">Q: My repo’s DVC is “busy and locked” and I’m not sure how it got that way and how to remove/diagnose the lock. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/608392956679815168" target="_blank" rel="nofollow noopener noreferrer">Any suggestions?</a><a href="#q-my-repos-dvc-is-busy-and-locked-and-im-not-sure-how-it-got-that-way-and-how-to-removediagnose-the-lock-any-suggestions" aria-label="q my repos dvc is busy and locked and im not sure how it got that way and how to removediagnose the lock any suggestions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC uses a lock file to prevent running two commands at the same time. The lock
<a href="https://dvc.org/doc/user-guide/dvc-internals" target="_blank" rel="nofollow noopener noreferrer">file</a> is under the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a>
directory. If no DVC commands running and you are still getting this error it’s
safe to remove this file manually to resolve the issue.</p>
<h3 id="q-im-trying-to-understand-how-does-dvc-remote-add-work-in-case-of-a-local-folder-and-what-is-the-best-workflow-when-data-is-outside-of-your-project-root" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/611209851757920266" target="_blank" rel="nofollow noopener noreferrer">I’m trying to understand how does DVC remote add work in case of a local folder and what is the best workflow when data is outside of your project root?</a><a href="#q-im-trying-to-understand-how-does-dvc-remote-add-work-in-case-of-a-local-folder-and-what-is-the-best-workflow-when-data-is-outside-of-your-project-root" aria-label="q im trying to understand how does dvc remote add work in case of a local folder and what is the best workflow when data is outside of your project root permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>When using DVC, in most cases we assume that your data will be somewhere under
project root. There is an option to use so called
<a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">external dependencies</a>,
which is data that is usually too big to be stored under your project root, but
if you operate on data that is of some reasonable size, I would recommend
starting with putting data somewhere under project root. Remotes are usually
places where you store your data, but it is DVC task to move your data around.
But if you want to keep your current setup where you will have data in different
place than your project, you will need to refer to data with full paths. So, for
example:</p>
<ol>
<li>
<p>You are in <code>/home/gabriel/myproject</code> and you have initialized dvc and git
repository</p>
</li>
<li>
<p>You have <code>featurize.py</code> in your project dir, and want to use data to produce
some features and than <code>train.py</code> to train a model.</p>
</li>
<li>
<p>Run the command:</p>
</li>
</ol>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> /research_data/myproject/videos <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> /research_data/myproject/features <span class="token punctuation">\</span>
python featurize.py</span></code></pre></div>
<p>to tell DVC, that you use <code>/research_data/myproject/videos</code> to featurize, and
produce output to your features dir. Note that your code should be aware of
those paths, they can be hardcoded inside <code>featurize.py</code>, but point of <code>dvc run</code>
is just to tell DVC what artifacts belong to currently defined step of ML
pipeline.</p>
<h3 id="q-when-i-run-du-command-to-check-how-much-space-dvc-project-consumes-i-see-that-it-duplicatescopies-data-its-very-space-and-time-consuming-to-copy-large-data-files-is-there-a-way-to-avoid-that-it-takes-too-long-to-add-large-files-to-dvc" style="position:relative;">Q: When I run <code>du</code> command to check how much space DVC project consumes I see that it duplicates/copies data. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/613935477896249364" target="_blank" rel="nofollow noopener noreferrer">It’s very space and time consuming to copy large data files, is there a way to avoid that?</a> It takes too long to add large files to DVC.<a href="#q-when-i-run-du-command-to-check-how-much-space-dvc-project-consumes-i-see-that-it-duplicatescopies-data-its-very-space-and-time-consuming-to-copy-large-data-files-is-there-a-way-to-avoid-that-it-takes-too-long-to-add-large-files-to-dvc" aria-label="q when i run du command to check how much space dvc project consumes i see that it duplicatescopies data its very space and time consuming to copy large data files is there a way to avoid that it takes too long to add large files to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! You don’t have to copy files with DVC. First of all, there are two reasons
when du can show that it takes double the space to store data under DVC control.
du can be inaccurate when the underlying file system supports reflinks (XFS on
Linux, APFS on Mac, etc). This is actually the best scenario since no copying is
happening and no changes are required to any DVC settings. Second, case means
that copy semantics is used by default. It can be turned off by providing cache
type <code>symlinks</code>, <code>hardlinks</code>. Please, read more on this
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-how-can-i-detach-a-file-from-dvc-control" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/615479227189559323" target="_blank" rel="nofollow noopener noreferrer">How can I detach a file from DVC control?</a><a href="#q-how-can-i-detach-a-file-from-dvc-control" aria-label="q how can i detach a file from dvc control permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Just removing the corresponding DVC-file and running <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> after that should
be enough. It’ll stop tracking the data file and clean the local cache that
might still contain it. Note! Don’t forget to run <a href="https://dvc.org/doc/command-reference/unprotect"><code>dvc unprotect</code></a> if you use
advanced<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer"> DVC setup with symlinks and hardlinks</a>
(<code>cache.type</code> config option is not default). If <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> behavior is not
granular enough you can manually find the by its cache from the DVC-file in
<code>.dvc/cache</code> and remote storage. Learn
<a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">here</a>
how they are organized.</p>
<h3 id="q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340" target="_blank" rel="nofollow noopener noreferrer">I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements.</a> That means that permanent deletion of files with sensitive data needs to be fully supported.<a href="#q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" aria-label="q im trying to understand if dvc is an appropriate solution for storing data under gdpr requirements that means that permanent deletion of files with sensitive data needs to be fully supported permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, in this sense DVC is not very different from using bare S3, SSH or any
other storage where you can go and just delete data. DVC can give a bit of
overhead to locate a specific file to delete, but otherwise it’s all the same
you will be able to delete any file you want. See more details on how you
retrospectively can edit directories under DVC control
<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/dvc-org-for-hacktoberfest-2019https://dvc.org/blog/dvc-org-for-hacktoberfest-2019Tue, 08 Oct 2019 00:00:00 GMT<p><a href="https://hacktoberfest.digitalocean.com/" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest</a> is a monthly-long
program that celebrates open source and encourages you to contribute to open
source projects (and rewards you with stickers and a cool T-shirt!). Whether
you’re a seasoned contributor or looking for projects to contribute to for the
first time, you’re welcome to participate!</p>
<p>It is the 6th season of Hacktoberfest and the 2d year of participating for
DVC.org team. We really enjoyed it in 2018 and this year we are upping the game
with our own cool stickers, special edition T-shirts and a
<a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">collection of carefully picked tickets</a>.</p>
<h3 id="how-to-participate" style="position:relative;">How to participate?<a href="#how-to-participate" aria-label="how to participate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you haven’t started your Hacktoberfest challenge yet, it is just the right
time, you have 3 weeks left to submit PRs and get your swag! Here are some
important details:</p>
<ul>
<li>
<p>Hacktoberfest is open to everyone in the global community.</p>
</li>
<li>
<p>You can sign up anytime between October 1 and October 31. Make sure to sign up
on the
<a href="https://hacktoberfest.digitalocean.com/" target="_blank" rel="nofollow noopener noreferrer">official Hacktoberfest website</a> for
your PRs to count.</p>
</li>
<li>
<p>To get a shirt, you must make 4 legit pull requests (PRs) between October 1–31
in any time zone.</p>
</li>
<li>
<p>Pull requests can be made in any public GitHub-hosted repositories/projects,
not just the ones highlighted.</p>
</li>
</ul>
<p>And the special addition from DVC.org team:</p>
<ul>
<li>
<p>Look through the list of
<a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">DVC Hacktoberfest tickets</a>
or the list of
<a href="https://github.com/iterative/dvc/labels/good%20first%20issue" target="_blank" rel="nofollow noopener noreferrer">good DVC first issues</a>.</p>
</li>
<li>
<p>Make a PR to DVC and get our stickers.</p>
</li>
<li>
<p>Close three issues for DVC and get a special DVC T-shirt.</p>
</li>
</ul>
<h3 id="why-contribute-to-dvc" style="position:relative;">Why contribute to DVC?<a href="#why-contribute-to-dvc" aria-label="why contribute to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> (Data Version Control) is a relatively young open source
project. It was started in late 2017 by a data scientist and an engineer to fill
in the gaps in the ML processes tooling. Nowadays DVC is growing pretty fast and
though our in-house team is quite small, we have to thank our contributors (more
than 100 in both code and docs) for developing DVC with us.</p>
<p>DVC is participating in Hacktoberfest for 2 years in a row to bring more people
into open source, to learn from them and to give back by sharing our own
experience. This year we decided to focus on a single important topic for us —
improving UI/UX.</p>
<p>As our contributors and maintainers were sifting through the feature requests,
bugs, and improvements to create a good
<a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">list of Hacktoberfest tickets</a>,
we noticed that UI/UX label on Github is popping up again and again. DVC is a
command line tool, and improving UI/UX in our case means making decisions on how
to name command options, where and when to use
<a href="https://github.com/iterative/dvc/issues/2498" target="_blank" rel="nofollow noopener noreferrer">confirmation prompts</a> and/or
where abort execution, what exactly user would expect to see in the output, how
to test it later, etc.</p>
<p>Why improving UI/UX appears to be so important for DVC at this stage? Perhaps
because the project is more mature now and we are ready to spend more time on
polishing it. Or maybe because it is still too-engineering focused and we used
to disregard/de-prioritize all this ‘fancy’ stuff. Or it is because we just lack
experience in creating good CLI UI/UX!</p>
<p>One or another, those are great reasons to focus on improving UI (in a broader
sense than just GUI), improving docs, creating powerful consistent experience
for our users and increasing accessibility of DVC.</p>
<p>That’s how
<a href="https://devcenter.heroku.com/articles/cli-style-guide" target="_blank" rel="nofollow noopener noreferrer">Heroku’s CLI style guide</a>
starts:</p>
<blockquote>
<p>Heroku CLI plugins should provide a clear user experience, targeted primarily
for human readability and usability, which delights the user, while at the
same time supporting advanced users and output formats. This article provides
a clear direction for designing delightful CLI plugins.</p>
</blockquote>
<p>At DVC we are building user experience in line with these principles too, but we
also have our own challenges. And here we turn for help to the global open
source community and all the contributors out there.</p>
<p>For all of us who have a heart for open source — let’s discuss, contribute,
learn, take the technologies forward and build something great together!</p>
<p>Happy hacking!</p>
<hr>
<p>We are happy to hear from you <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too!</p>https://dvc.org/blog/september-19-dvc-heartbeathttps://dvc.org/blog/september-19-dvc-heartbeatThu, 26 Sep 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We are super excited to co-host our very first
<strong><a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">meetup in San Francisco on October 10</a></strong>!
We will gather at the brand new Dropbox HQ office at 6:30 pm to discuss
open-source tools to version control ML models and experiments.
<a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> is teaming up with
<a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Daniel Fischetti</a> from
<a href="https://standard.ai/" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a> to discuss best ML practices. Join us
and save your spot now:</p>
<p>
</p><section class="elp-content-holder">
<a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Open-source tools to version control Machine Learning models and experiments</h4>
<div class="elp-description">AI and ML are becoming an essential part of the engineering and data science everyday workflow. ML teams need new tools…</div>
<div class="elp-link">meetup.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/open-source-tools-to-version-control-9fbecd80e325857bc75eecce526311b8.png" alt="Open-source tools to version control Machine Learning models and experiments">
</div>
</a>
</section>
<p></p>
<p>If you are not in SF on this date and happen to be in Europe — don’t miss the
PyCon DE & PyData Berlin 2019 joint event on October 9–11. We cannot make it to
Berlin this year, but we were thrilled to discover 2 independent talks featuring
DVC by
<a href="https://pyvideo.org/pydata-berlin-2019/version-control-for-data-science.html" target="_blank" rel="nofollow noopener noreferrer">Alessia Marcolini</a>
and
<a href="https://pyvideo.org/pydata-berlin-2019/tools-that-help-you-get-your-experiments-under-control.html" target="_blank" rel="nofollow noopener noreferrer">Katharina Rasch</a>.</p>
<p>Some other highlights of the end of summer:</p>
<ul>
<li>
<p>Our users and contributors keep creating fantastic pieces of content around
DVC (sharing some links below, but it’s only a fraction of what we have in
stock — can’t be more happy and humbled about it!).</p>
</li>
<li>
<p>We’ve reached 79 contributors to
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC core project</a> and 74 contributors to
<a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a> (and have something
special in mind to celebrate our 100th contributors).</p>
</li>
<li>
<p>We enjoyed working with all the talented
<a href="https://developers.google.com/season-of-docs/" target="_blank" rel="nofollow noopener noreferrer">Google Season of docs</a>
applicants and now moving to the next stage with our chosen tech writer
<a href="http://dashohoxha.fs.al/" target="_blank" rel="nofollow noopener noreferrer">Dashamir Hoxha</a>.</p>
</li>
<li>
<p>We’ve crossed the 3,000 stars mark on Github
(<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">over 3,500 now</a>). Thank you for your
support!</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://t.co/vhkN3zWzjT">https://t.co/vhkN3zWzjT</a> just hit 3000 stars on <a href="https://twitter.com/hashtag/Github?src=hash&ref_src=twsrc%5Etfw">#Github</a>! <a href="https://t.co/AILppwghuu">https://t.co/AILppwghuu</a> <br>Thank you for your trust, your contributions and your insights🤝<br>We are beyond happy to have you with us on this exciting journey🚀 <a href="https://t.co/dwokD2v7t7">pic.twitter.com/dwokD2v7t7</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1147220439472545793">July 5, 2019</a></blockquote>
</li>
<li>
<p>We’ve had great time at the
<a href="https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/program/" target="_blank" rel="nofollow noopener noreferrer">Open Source Summit</a>
by Linux foundation in San Diego — speaking on stage, running a booth and
chatting with all the amazing open-source crowd out there.</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Love all <a href="https://twitter.com/DVCorg">@DVCorg</a> booth buzz at <a href="https://twitter.com/hashtag/OSSummit?src=hash&ref_src=twsrc%5Etfw">#OSSummit</a>! 🎉<br>Stop by and grab some cool swag 🌈and participate in our easy fun contest to win a Jetson Nano, the coolest fuzzy owls and a bunch of other staff! 🤩 <a href="https://t.co/MIzfilhrRJ">pic.twitter.com/MIzfilhrRJ</a></p>— Sveta Grinchenko 🇺🇦 (@a142hr) <a href="https://twitter.com/a142hr/status/1164256520235675648">August 21, 2019</a></blockquote>
</li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ccbbea0b26a9ac64744739bf7a5ee8b5/03346/open-source-summit-by-linux-foundation.jpg" alt="open source summit by linux foundation" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<hr>
<p>Here are some of the great pieces of content around DVC and ML ops that we
discovered in July and August:</p>
<ul>
<li>
<p>** Great insightful discussion on Twitter about versioning ML projects started
by <a href="https://medium.com/@NathanBenaich" target="_blank" rel="nofollow noopener noreferrer">Nathan Benaich</a>.**</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🙏Question to ML friends: How do you go about version control for your ML projects (data, models, and intermediate steps in your data pipelines)? Have you built your own tools? Are using something open source? Or a SaaS? Or does this come bundled with your ML infra products? Thx!</p>— Nathan Benaich (@nathanbenaich) <a href="https://twitter.com/nathanbenaich/status/1151815916512010242">July 18, 2019</a></blockquote>
</li>
<li>
<p><strong><a href="https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804" target="_blank" rel="nofollow noopener noreferrer">Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers</a>
by <a href="https://medium.com/@ward.vanlaer" target="_blank" rel="nofollow noopener noreferrer">Ward Van Laer</a>.</strong></p>
</li>
</ul>
<blockquote>
<p>It is possible to manage your work flow using open-source and free tools.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers</h4>
<div class="elp-description">Googling for machine learning frameworks to version data, track python models etc.. I was surprised to see that these…</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/our-machine-learning-workflow-356399465e0f6c05c8d759fbc3be264a.jpeg" alt="Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe" target="_blank" rel="nofollow noopener noreferrer">Using DVC to create an efficient version control system for data projects</a>
by <a href="https://medium.com/@basile_16101" target="_blank" rel="nofollow noopener noreferrer">Basile Guerrapin</a>.</strong></li>
</ul>
<blockquote>
<p>DVC brought versioning for inputs, intermediate files and algorithm models to
the VAT auto-detection project and this drastically increased our
<strong>productivity</strong>.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Using DVC to create an efficient version control system for data projects</h4>
<div class="elp-description">At first we were looking for a tool to help us dealing with production data files such as trained machine learning…</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/using-dvc-to-create-an-efficient-vcs-377e67b47c660bf412f0284e01c46d16.png" alt="Using DVC to create an efficient version control system for data projects">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://techsparx.com/software-development/ai/dvc/versioning-example.html" target="_blank" rel="nofollow noopener noreferrer">Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues</a>
by <a href="https://twitter.com/7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a>.</strong></li>
</ul>
<blockquote>
<p>In this tutorial we will go over a simple image classifier. We will learn how
DVC works in a machine learning project, how it optimizes reproducing results
when the project is changed, and how to share the project with colleagues.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://techsparx.com/software-development/ai/dvc/versioning-example.html" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues</h4>
<div class="elp-description">Software Development Artificial Intelligence Data Version Control (DVC) Managing versioned machine learning datasets in…</div>
<div class="elp-link">techsparx.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/managing-versioned-machine-learning-datasets-c82b4558da0d1197f2ecafe10ae88e5b.jpeg" alt="Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185" target="_blank" rel="nofollow noopener noreferrer">How to use data version control (dvc) in a machine learning project</a>
by <a href="https://towardsdatascience.com/@matthiasbitzer94" target="_blank" rel="nofollow noopener noreferrer">Matthias Bitzer</a>.</strong></li>
</ul>
<blockquote>
<p>To illustrate the use of dvc in a machine learning context, we assume that our
data is divided into train, test and validation folders by default, with the
amount of data increasing over time either through an active learning cycle or
by manually adding new data.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">How to use data version control (dvc) in a machine learning project</h4>
<div class="elp-description">When working in a productive machine learning project you probably deal with a tone of data and several models. To keep…</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/how-to-use-data-version-control-0e75fc4c1aaa64b0466ba4235a189f56.jpeg" alt="How to use data version control (dvc) in a machine learning project">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" target="_blank" rel="nofollow noopener noreferrer">Version Control ML Model</a>
by <a href="https://towardsdatascience.com/@TianchenW" target="_blank" rel="nofollow noopener noreferrer">Tianchen Wu</a></strong></li>
</ul>
<blockquote>
<p>This post presents a solution to version control machine learning models with
git and dvc (<a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a>).</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Version Control ML Model</h4>
<div class="elp-description">Machine Learning operations (let’s call it MLOps under the current buzzword pattern xxOps) are quite different from…</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/version-control-ml-model-d95a668bbc2b17aaf3cb8795e510d604.png" alt="Version Control ML Model">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://dev.to/robogeek/reflinks-vs-symlinks-vs-hard-links-and-how-they-can-help-machine-learning-projects-1cj4" target="_blank" rel="nofollow noopener noreferrer">Reflinks vs symlinks vs hard links, and how they can help machine learning projects</a>
by <a href="https://medium.com/@7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a></strong></li>
</ul>
<blockquote>
<p>In this blog post we’ll go over the details of using links, some cool new
stuff in modern file systems (reflinks), and an example of how DVC (Data
Version Control, <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/</a>) leverages this.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Reflinks vs symlinks vs hard links, and how they can help machine learning projects</h4>
<div class="elp-description">Hard links and symbolic links have been available since time immemorial, and we use them all the time without even…</div>
<div class="elp-link">dev.to</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/reflinks-vs-symlinks-vs-hard-links-b95aa2004eda8198752604ff86f8321c.jpeg" alt="Reflinks vs symlinks vs hard links, and how they can help machine learning projects">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://blog.codecentric.de/en/2019/08/dvc-dependency-management/" target="_blank" rel="nofollow noopener noreferrer">DVC dependency management — a guide</a>
by <a href="https://blog.codecentric.de/en/author/bert-besser/" target="_blank" rel="nofollow noopener noreferrer">Bert Besser</a> and
<a href="https://blog.codecentric.de/en/author/veronika-schindler/" target="_blank" rel="nofollow noopener noreferrer">Veronika Schwan</a>.</strong></li>
</ul>
<blockquote>
<p>This post is a follow-up to
<a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">A walkthrough of DVC</a>
that deals with managing dependencies between DVC projects. In particular,
this follow-up is about importing specific versions of an artifact (e.g. a
trained model or a dataset) from one DVC project into another.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://blog.codecentric.de/en/2019/08/dvc-dependency-management/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">DVC dependency management - a guide - codecentric AG Blog</h4>
<div class="elp-description">This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In…</div>
<div class="elp-link">blog.codecentric.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/dvc-org-a5af4abb87a983796d837b5df9b4f382.png" alt="DVC dependency management - a guide - codecentric AG Blog">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" target="_blank" rel="nofollow noopener noreferrer">Effective ML Teams — Lessons Learne</a>
by <a href="https://medium.com/@czeslaw.szubert" target="_blank" rel="nofollow noopener noreferrer">Czeslaw Szubert</a></strong></li>
</ul>
<blockquote>
<p>In this post I’ll present lessons learned on how to setup successful ML teams
and what you need to devise an effective enterprise ML strategy.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Effective ML Teams — Lessons Learned</h4>
<div class="elp-description">Machine Learning and Artificial Intelligence has entered our everyday lives — from Virtual Assistants built into each…</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/effective-ml-teams-7a63cfae30c573559b85ebf714981f26.jpeg" alt="Effective ML Teams — Lessons Learned">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://www.esentri.com/lessons-learned-from-training-a-german-speech-recognition-model/" target="_blank" rel="nofollow noopener noreferrer">Lessons learned from training a German Speech Recognition model</a>
by <a href="https://www.linkedin.com/in/dschoenleber/" target="_blank" rel="nofollow noopener noreferrer">David Schönleber</a>.</strong></li>
</ul>
<blockquote>
<p>Setting up a documentation-by-design workflow and using appropriate tools
where needed, e.g. <em>MLFlow</em> and <em>dvc,</em> can be a real deal-breaker.</p>
</blockquote>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Lessons Learned from Training a German Speech Recognition Model - esentri AG</h4>
<div class="elp-description">This post is the first of a two-part series. In this first part, I address learnings from a recent project in which I…</div>
<div class="elp-link">esentri.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-09-26/lessons-learned-from-training-47361fe6a8b0e428d267d1cdf745c431.jpeg" alt="Lessons Learned from Training a German Speech Recognition Model - esentri AG">
</div>
</a>
</section>
<p></p>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-im-getting-an-error-message-while-trying-to-use-aws-s3-storage-error-failed-to-push-data-to-the-cloud--unable-to-locate-credentials-any-ideas-whats-happening" style="position:relative;">Q: I’m getting an error message while trying to use AWS S3 storage: <code>ERROR: failed to push data to the cloud — Unable to locate credentials.</code> <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/587792932061577218" target="_blank" rel="nofollow noopener noreferrer">Any ideas what’s happening?</a><a href="#q-im-getting-an-error-message-while-trying-to-use-aws-s3-storage-error-failed-to-push-data-to-the-cloud--unable-to-locate-credentials-any-ideas-whats-happening" aria-label="q im getting an error message while trying to use aws s3 storage error failed to push data to the cloud unable to locate credentials any ideas whats happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Most likely you haven’t configured your S3 credentials/AWS account yet. Please,
read the full documentation on the AWS website. The short version of what should
be done is the following:</p>
<ul>
<li>
<p><a href="https://portal.aws.amazon.com/gp/aws/developer/registration/index.html" target="_blank" rel="nofollow noopener noreferrer">Create your AWS account.</a></p>
</li>
<li>
<p>Log in to your AWS Management Console.</p>
</li>
<li>
<p>Click on your user name at the top right of the page.</p>
</li>
<li>
<p>Click on the Security Credentials link from the drop-down menu.</p>
</li>
<li>
<p>Find the Access Credentials section, and copy the latest <code>Access Key ID</code>.</p>
</li>
<li>
<p>Click on the Show link in the same row, and copy the <code>Secret Access Key</code>.</p>
</li>
</ul>
<p>Follow
<a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html" target="_blank" rel="nofollow noopener noreferrer">this link</a>
to setup your environment.</p>
<h3 id="q-i-added-data-with-dvc-add-or-dvc-run-and-see-that-it-takes-twice-what-it-was-before-with-du-command-does-it-mean-that-dvc-copies-data-that-is-added-under-its-control-how-do-i-prevent-this-from-happening" style="position:relative;">Q: I added data with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or <code>dvc run</code> and see that it takes twice what it was before (with <code>du</code> command). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/595402051203235861" target="_blank" rel="nofollow noopener noreferrer">Does it mean that DVC copies data that is added under its control? How do I prevent this from happening?</a><a href="#q-i-added-data-with-dvc-add-or-dvc-run-and-see-that-it-takes-twice-what-it-was-before-with-du-command-does-it-mean-that-dvc-copies-data-that-is-added-under-its-control-how-do-i-prevent-this-from-happening" aria-label="q i added data with dvc add or dvc run and see that it takes twice what it was before with du command does it mean that dvc copies data that is added under its control how do i prevent this from happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>To give a short summary — by default, DVC copies the files from your working
directory to the cache (this is for safety reasons, it is better to duplicate
the data). If you have reflinks (copy-on-write) enabled on your file system, DVC
will use that method — which is as safe as copying. You can also configure DVC
to use hardlinks/symlinks to save some space and time, but it will require
enabling the protected mode (making data files in workspace read-only). Read
more details <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-how-concurrent-friendly-is-the-cache-and-different-remotes-is-it-safe-to-have-several-containersnodes-fill-the-same-cache-at-the-same-time" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/599345778703597568" target="_blank" rel="nofollow noopener noreferrer">How concurrent-friendly is the cache? And different remotes? Is it safe to have several containers/nodes fill the same cache at the same time?</a><a href="#q-how-concurrent-friendly-is-the-cache-and-different-remotes-is-it-safe-to-have-several-containersnodes-fill-the-same-cache-at-the-same-time" aria-label="q how concurrent friendly is the cache and different remotes is it safe to have several containersnodes fill the same cache at the same time permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It is safe and a very common use case for DVC to have a shared cache. Please,
check <a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/12" target="_blank" rel="nofollow noopener noreferrer">this thread</a>,
for example.</p>
<h3 id="qwhat-is-the-proper-way-to-exit-the-ascii-visualization-when-you-run-dvc-pipeline-show-command" style="position:relative;">Q:<a href="https://discordapp.com/channels/485586884165107732/563406153334128681/603890677176336394" target="_blank" rel="nofollow noopener noreferrer">What is the proper way to exit the ASCII visualization?</a> (when you run <code>dvc pipeline show</code> command).<a href="#qwhat-is-the-proper-way-to-exit-the-ascii-visualization-when-you-run-dvc-pipeline-show-command" aria-label="qwhat is the proper way to exit the ascii visualization when you run dvc pipeline show command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>See this
<a href="https://dvc.org/doc/commands-reference/pipeline/show#options" target="_blank" rel="nofollow noopener noreferrer">document</a>. To
navigate, use arrows or W, A, S, D keys. To exit, press Q.</p>
<h3 id="q-is-there-an-issue-if-i-set-my-caches3-external-cache-to-my-default-remote-i-dont-quite-understand-what-an-external-cache-is-for-other-than-i-have-to-have-it-for-external-outputs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606197026488844338" target="_blank" rel="nofollow noopener noreferrer">Is there an issue if I set my <code>cache.s3</code> external cache to my default remote?</a> I don’t quite understand what an external cache is for other than I have to have it for external outputs.<a href="#q-is-there-an-issue-if-i-set-my-caches3-external-cache-to-my-default-remote-i-dont-quite-understand-what-an-external-cache-is-for-other-than-i-have-to-have-it-for-external-outputs" aria-label="q is there an issue if i set my caches3 external cache to my default remote i dont quite understand what an external cache is for other than i have to have it for external outputs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Short answer is that we would suggest keeping them separately to avoid possible
checksum overlaps. Checksum on S3 might theoretically overlap with our checksums
(with the content of the file being different), so it could be dangerous. The
chances of losing data are pretty slim, but we would not risk it. Right now, we
are working on making sure there are no possible overlapping.</p>
<h3 id="q-whats-the-right-procedure-to-move-a-step-dvc-file-around-the-project" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606425815139221504" target="_blank" rel="nofollow noopener noreferrer">What’s the right procedure to move a step .dvc file around the project?</a><a href="#q-whats-the-right-procedure-to-move-a-step-dvc-file-around-the-project" aria-label="q whats the right procedure to move a step dvc file around the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Assuming the file was created with <code>dvc run</code>. There are few possible ways.
Obvious one is to delete the file and create a new one with
<code>dvc run --no-exec -f file/path/and/name.dvc</code>. Another possibility is to
rename/move and then edit manually. See
<a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">this document</a> that describes
how DVC-files are organized. No matter what method you use, you can run
<a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit file.dvc</code></a> to save changes without running the command again.</p>
<h3 id="q-dvc-status-doesnt-seem-to-report-things-that-need-to-be-dvc-pushed-is-that-by-design" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606917839688957952" target="_blank" rel="nofollow noopener noreferrer"><code>dvc status</code> doesn’t seem to report things that need to be dvc pushed, is that by design?</a><a href="#q-dvc-status-doesnt-seem-to-report-things-that-need-to-be-dvc-pushed-is-that-by-design" aria-label="q dvc status doesnt seem to report things that need to be dvc pushed is that by design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You should try with dvc status <code>--cloud</code> or <a href="https://dvc.org/doc/command-reference/status#--remote"><code>dvc status --remote <your-remote></code></a>
to compare your local cache with a remote one, by default it only compares the
“working directory” with your local cache (to check whether something should be
reproduced and saved or not).</p>
<h3 id="q-what-kind-of-files-can-you-put-into-dvc-metrics" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/608701494035873792" target="_blank" rel="nofollow noopener noreferrer">What kind of files can you put into <code>dvc metrics</code>?</a><a href="#q-what-kind-of-files-can-you-put-into-dvc-metrics" aria-label="q what kind of files can you put into dvc metrics permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The file could be in any format, <a href="https://dvc.org/doc/command-reference/metrics"><code>dvc metrics</code></a> show will try to interpret the
format and output it in the best possible way. Also, if you are using <code>csv</code> or
<code>json</code>, you can use the <code>--xpath</code> flag to query specific measurements. <strong>In
general, you can make any file a metric file and put any content into it, DVC is
not opinionated about it.</strong> Usually though these are files that measures the
performance/accuracy of your model and captures configuration of experiments.
The idea is to use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> to display all your metrics across
experiments so you can make decisions of which combination (of features,
parameters, algorithms, architecture, etc.) works the best.</p>
<h3 id="q-does-dvc-take-into-account-the-timestamp-of-a-file-or-is-the-md5-only-depends-on-the-files-actualbits-content" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/613639458000207902" target="_blank" rel="nofollow noopener noreferrer">Does DVC take into account the timestamp of a file or is the MD5 only depends on the files actual/bits content?</a><a href="#q-does-dvc-take-into-account-the-timestamp-of-a-file-or-is-the-md5-only-depends-on-the-files-actualbits-content" aria-label="q does dvc take into account the timestamp of a file or is the md5 only depends on the files actualbits content permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC takes into account only content (bits) of a file to calculate hashes that
are saved into DVC-files.</p>
<h3 id="q-similar-to-dvc-gc-is-there-a-command-to-garbage-collect-from-the-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/616421757808541721" target="_blank" rel="nofollow noopener noreferrer">Similar to <code>dvc gc</code> is there a command to garbage collect from the remote?</a><a href="#q-similar-to-dvc-gc-is-there-a-command-to-garbage-collect-from-the-remote" aria-label="q similar to dvc gc is there a command to garbage collect from the remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://dvc.org/doc/command-reference/gc#--remote"><code>dvc gc --remote NAME</code></a> is doing this, but you should be extra careful, because
it will remove everything that is not currently “in use” (by the working
directory). Also, please check this
<a href="https://github.com/iterative/dvc/issues/2325" target="_blank" rel="nofollow noopener noreferrer">issue</a> — semantics of this
command might have changed by the time you read this.</p>
<h3 id="q-how-do-i-use-and-configure-remote-storage-on-ibm-cloud-object-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/591237578209099786" target="_blank" rel="nofollow noopener noreferrer">How do I use and configure remote storage on IBM Cloud Object Storage?</a><a href="#q-how-do-i-use-and-configure-remote-storage-on-ibm-cloud-object-storage" aria-label="q how do i use and configure remote storage on ibm cloud object storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Since it’s S3 compatible, specifying <code>endpointurl</code> (exact URL depends on the
<a href="https://cloud.ibm.com/docs/services/cloud-object-storage?topic=cloud-object-storage-endpoints" target="_blank" rel="nofollow noopener noreferrer">region</a>)
is the way to go:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> mybucket s3://path/to/dir
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> mybucket <span class="token punctuation">\</span>
endpointurl <span class="token punctuation">\</span>
https://s3.eu.cloud-object-storage.appdomain.cloud</span></code></pre></div>
<h3 id="q-how-can-i-push-data-from-client-to-google-cloud-bucket-using-dvc-just-want-to-know-how-can-i-set-the-credentials" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/592958360903483403" target="_blank" rel="nofollow noopener noreferrer">How can I push data from client to google cloud bucket using DVC?</a>. Just want to know how can i set the credentials.<a href="#q-how-can-i-push-data-from-client-to-google-cloud-bucket-using-dvc-just-want-to-know-how-can-i-set-the-credentials" aria-label="q how can i push data from client to google cloud bucket using dvc just want to know how can i set the credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can do it by setting environment variable pointing to yours credentials
path, like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">GOOGLE_APPLICATION_CREDENTIALS</span><span class="token operator">=</span>path/to/credentials</span></code></pre></div>
<p>It is also possible to set this variable via <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a>:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote credentialpath /path/to/my/creds</span></code></pre></div>
<p>where <code>myremote</code> is your remote name.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/july-19-dvc-heartbeathttps://dvc.org/blog/july-19-dvc-heartbeatThu, 01 Aug 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>As we continue to grow DVC together with our fantastic contributors, we enjoy
more and more insights, discussions, and articles either created or brought to
us by our community. We feel it is the right time to start sharing more of your
news, your stories and your discoveries. New Heartbeat is here!</p>
<p>Speaking of our own news — next month DVC team is going to the
<a href="https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/" target="_blank" rel="nofollow noopener noreferrer">Open Source North America Summit</a>.
It is taking place in San Diego on August 21–23.
<a href="https://ossna19.sched.com/speaker/dmitry35" target="_blank" rel="nofollow noopener noreferrer">Dmitry</a> and
<a href="https://ossna19.sched.com/speaker/svetlanagrinchenko" target="_blank" rel="nofollow noopener noreferrer">Sveta</a> will be giving
talks and we will run a booth. So looking forward to it! Stop by for a chat and
some cool swag. And if you are in San Diego on those days and want to catch up —
please let us know <a href="http://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> or on Twitter!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://ossna19.sched.com/event/PUVv/open-source-tools-for-ml-experiments-management-dmitry-petrov-ruslan-kuprieiev-iterative-ai" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...</h4>
<div class="elp-description">Speakers Software Engineer, Iterative AI Ruslan is a Software Engineer at Iterative AI. Previously he worked on live…</div>
<div class="elp-link">ossna19.sched.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-08-01/open-source-north-america-summit-fc282755298bb8aa0dd4feb0d7fad084.png" alt="Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...">
</div>
</a>
</section>
<p></p>
<p>
</p><section class="elp-content-holder">
<a href="https://ossna19.sched.com/event/PWNk/speaker-preparation-simple-steps-with-a-tremendous-impact-svetlana-grinchenko-dvcorg" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...</h4>
<div class="elp-description">Speakers Head of Developer Relations, DVC.org Svetlana is driving developer relations and community at DVC.org…</div>
<div class="elp-link">ossna19.sched.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-08-01/open-source-north-america-summit-fc282755298bb8aa0dd4feb0d7fad084.png" alt="Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...">
</div>
</a>
</section>
<p></p>
<p>Every month our team is excited to discover new great pieces of content
addressing some of the burning ML issues. Here are some of the links that caught
our eye in June:</p>
<ul>
<li><strong><a href="https://dev.to/robogeek/principled-machine-learning-4eho" target="_blank" rel="nofollow noopener noreferrer">Principled Machine Learning: Practices and Tools for Efficient Collaboration</a>
by <a href="https://medium.com/@7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://dev.to/robogeek/principled-machine-learning-4eho" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Principled Machine Learning: Practices and Tools for Efficient Collaboration</h4>
<div class="elp-description">Machine learning projects are often harder than they should be. The code to train an ML model is just software, and we…</div>
<div class="elp-link">dev.to</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-08-01/principled-machine-learning-20fa6eb01b0a36da8cdc0d97d2f96cc2.jpeg" alt="Principled Machine Learning: Practices and Tools for Efficient Collaboration">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>As we’ve seen in this article some tools and practices can be borrowed from
regular software engineering. However, the needs of machine learning projects
dictate tools that better fit the purpose.</p>
</blockquote>
<ul>
<li><strong>First
<a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML-REPA</a><a href="http://ml-repa.ru/page6697700.html" target="_blank" rel="nofollow noopener noreferrer">Meetup: Reproducible ML experiments</a>
hosted by <a href="https://dgtl.raiffeisen.ru/" target="_blank" rel="nofollow noopener noreferrer">Raiffeisen DGTL</a> check out the video
and slide decks.</strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="http://ml-repa.ru/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Machine Learning REPA</h4>
<div class="elp-description">Анонсы мероприятий, проектов, обзоров инструментов и кейсов про ML проекты, управление экспериментами, автоматизацию и…</div>
<div class="elp-link">ml-repa.ru</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-08-01/machine-learning-repa-56a858376395041b5fa650963e76fac1.png" alt="Machine Learning REPA">
</div>
</a>
</section>
<p></p>
<p><a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML-REPA</a> is an a new fantastic resource for
Russian-speaking folks interested in Reproducibility, Experiments and Pipelines
Automation. Curated by <a href="https://twitter.com/mnrozhkov" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> and
highly recommended by our team.</p>
<h3 id="how-do-you-manage-your-machine-learning-experiments-discussion-on-reddit-is-full-of-insights" style="position:relative;"><a href="https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/" target="_blank" rel="nofollow noopener noreferrer">How do you manage your machine learning experiments?</a> discussion on Reddit is full of insights.<a href="#how-do-you-manage-your-machine-learning-experiments-discussion-on-reddit-is-full-of-insights" aria-label="how do you manage your machine learning experiments discussion on reddit is full of insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<blockquote class="reddit-card" data-card-created="1576789144"><a href="https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/">[D] How do you manage your machine learning experiments?</a> from <a href="http://www.reddit.com/r/MachineLearning">r/MachineLearning</a></blockquote>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-i-have-within-one-git-repository-different-folders-with-very-different-content-basically-different-projects-or-content-i-want-to-have-different-permissions-to-and-i-thought-about-using-different-buckets-in-aws-as-remotes-im-not-sure-if-its-possible-with-dvc-to-store-some-files-in-some-remote-and-some-other-files-in-some-other-remote-is-it" style="position:relative;">Q: I have within one git repository different folders with very different content (basically different projects, or content I want to have different permissions to), and I thought about using different buckets in AWS as remotes. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575718048330416158" target="_blank" rel="nofollow noopener noreferrer">I’m not sure if it’s possible with DVC to store some files in some remote, and some other files in some other remote, is it?</a><a href="#q-i-have-within-one-git-repository-different-folders-with-very-different-content-basically-different-projects-or-content-i-want-to-have-different-permissions-to-and-i-thought-about-using-different-buckets-in-aws-as-remotes-im-not-sure-if-its-possible-with-dvc-to-store-some-files-in-some-remote-and-some-other-files-in-some-other-remote-is-it" aria-label="q i have within one git repository different folders with very different content basically different projects or content i want to have different permissions to and i thought about using different buckets in aws as remotes im not sure if its possible with dvc to store some files in some remote and some other files in some other remote is it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can definitely add more than one remote (see
<a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">dvc remote add</a>) and then
<a href="https://dvc.org/doc/commands-reference/push" target="_blank" rel="nofollow noopener noreferrer">dvc push</a> has a <code>-R</code> option to
pick which one to send the cached data files (deps, outs, etc) to. We would not
recommend doing this though. It complicates the commands you have to run — you
will need to remember to specify a remote name for every command that deals with
data — <code>push</code>, <code>pull</code>, <code>gc</code>, <code>fetch</code>, <code>status</code>, etc. Please, leave a comment in
the relevant issue <a href="https://github.com/iterative/dvc/issues/2095" target="_blank" rel="nofollow noopener noreferrer">here</a> if this
case is important for you.</p>
<h3 id="q-is-that-possible-with-dvc-to-have-multiple-few-metric-files-and-compare-them-all-at-once-for-example-wed-like-to-consider-as-metrics-the-loss-of-a-neural-network-training-process-loss-as-a--m-output-of-a-training-stage-and-also-apart-knowing-the-accuracy-of-the-nn-on-a-test-set-another--m-output-of-eval-stage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/578532350221352987" target="_blank" rel="nofollow noopener noreferrer">Is that possible with DVC to have multiple (few) metric files and compare them all at once?</a> For example, we’d like to consider as metrics the loss of a neural network training process (loss as a <code>-M</code> output of a training stage), and also apart knowing the accuracy of the NN on a test set (another <code>-M</code> output of eval stage).<a href="#q-is-that-possible-with-dvc-to-have-multiple-few-metric-files-and-compare-them-all-at-once-for-example-wed-like-to-consider-as-metrics-the-loss-of-a-neural-network-training-process-loss-as-a--m-output-of-a-training-stage-and-also-apart-knowing-the-accuracy-of-the-nn-on-a-test-set-another--m-output-of-eval-stage" aria-label="q is that possible with dvc to have multiple few metric files and compare them all at once for example wed like to consider as metrics the loss of a neural network training process loss as a m output of a training stage and also apart knowing the accuracy of the nn on a test set another m output of eval stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, it is totally fine to use <code>-M</code> in different stages. <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> will
just show both metrics.</p>
<h3 id="q-i-have-a-scenario-where-an-artifacts-data-folder-is-created-by-the-dvc-run-command-via-the--o-flag-i-have-manually-added-another-file-into-or-modified-the-artifacts-folder-but-when-i-do-dvc-push-nothing-happens-is-there-anyway-around-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/577362750443880449" target="_blank" rel="nofollow noopener noreferrer">I have a scenario where an artifacts (data) folder is created by the dvc run command via the <code>-o</code> flag. I have manually added another file into or modified the artifacts folder but when I do <code>dvc push</code> nothing happens, is there anyway around this?</a><a href="#q-i-have-a-scenario-where-an-artifacts-data-folder-is-created-by-the-dvc-run-command-via-the--o-flag-i-have-manually-added-another-file-into-or-modified-the-artifacts-folder-but-when-i-do-dvc-push-nothing-happens-is-there-anyway-around-this" aria-label="q i have a scenario where an artifacts data folder is created by the dvc run command via the o flag i have manually added another file into or modified the artifacts folder but when i do dvc push nothing happens is there anyway around this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s first do a quick recap on how DVC handles data files (you can definitely
find more information on the <a href="http://dvc.org/docs" target="_blank" rel="nofollow noopener noreferrer">DVC documentation site</a>).</p>
<ul>
<li>
<p>When you do <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <code>dvc run</code> or <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> DVC puts artifacts (in case
of <code>dvc run</code> artifacts == outputs produced by the command) into <code>.dvc/cache</code>
directory (default cache location). You don’t see this happening because
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">DVC keeps links</a>
(or in certain cases creates a copy) to these files/directories.</p>
</li>
<li>
<p><a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> does not move files from the workspace (that what you see) to the
remote storage, it always moves files/directories that are already in cache
(default is .dvc/cache).</p>
</li>
<li>
<p>So, now you’ve added a file manually, or made some other modifications. But
these files are not in cache yet. The analogy would be <code>git commit</code>. You
change the file, you do <code>git commit</code>, only after that you can push something
to Git server (Github/Gitlab, etc). The difference is that DVC is doing commit
(moves files to cache) automatically in certain cases — <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <code>dvc run</code>,
etc.</p>
</li>
</ul>
<p>There is an explicit command — <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> - that you should run if you want to
enforce the change to the output produced by <code>dvc run</code>. This command will update
the corresponding DVC- files (.dvc extension) and will move data to cache. After
that you should be able to run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to save your data on the external
storage.</p>
<p>Note, when you do an explicit commit like this you are potentially “breaking”
the reproducibility. In a sense that there is no guarantee now that your
directory can be produced by <code>dvc run</code>/<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> — since you changed it
manually.</p>
<h3 id="q-id-like-to-transform-my-dataset-in-place-to-avoid-copying-it-but-i-cant-use-dvc-run-to-do-this-because-it-doesnt-allow-the-same-directory-as-an-output-and-a-dependency" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/578898899469729796" target="_blank" rel="nofollow noopener noreferrer">I’d like to transform my dataset in-place to avoid copying it, but I can’t use <code>dvc run</code> to do this because it doesn’t allow the same directory as an output and a dependency.</a><a href="#q-id-like-to-transform-my-dataset-in-place-to-avoid-copying-it-but-i-cant-use-dvc-run-to-do-this-because-it-doesnt-allow-the-same-directory-as-an-output-and-a-dependency" aria-label="q id like to transform my dataset in place to avoid copying it but i cant use dvc run to do this because it doesnt allow the same directory as an output and a dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You could do this in one step (one stage). So that getting your data and
modifying it, is one stage. So you don’t depend on the data folder. You just
could depend on your download + modifying script.</p>
<h3 id="q-can-anyone-tell-me-what-this-error-message-is-about-to-avoid-unpredictable-behavior-rerun-command-with-non-overlapping-outs-paths" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/579283950778712076" target="_blank" rel="nofollow noopener noreferrer">Can anyone tell me what this error message is about?</a> “To avoid unpredictable behavior, rerun command with non overlapping outs paths.”<a href="#q-can-anyone-tell-me-what-this-error-message-is-about-to-avoid-unpredictable-behavior-rerun-command-with-non-overlapping-outs-paths" aria-label="q can anyone tell me what this error message is about to avoid unpredictable behavior rerun command with non overlapping outs paths permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Most likely it means that there is a DVC-file that have the same output twice.
Or there two DVC-files that share the same output file.</p>
<h3 id="q-im-getting-no-such-file-or-directory-error-when-i-do-dvc-run-or-dvc-repro-the-command-runs-find-if-i-dont-use-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/580176327701823498" target="_blank" rel="nofollow noopener noreferrer">I’m getting “No such file or directory” error when I do <code>dvc run</code> or <code>dvc repro</code></a>. The command runs find if I don’t use DVC.<a href="#q-im-getting-no-such-file-or-directory-error-when-i-do-dvc-run-or-dvc-repro-the-command-runs-find-if-i-dont-use-dvc" aria-label="q im getting no such file or directory error when i do dvc run or dvc repro the command runs find if i dont use dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>That happens because dvc run is trying to ensure that your command is the one
creating your output and removes existing outputs before executing the command.
So that when you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> later, it will be able to fully reproduce the
output. So you need to make the script create the directory or file.</p>
<h3 id="q-im-implementing-a-cicd-and-i-would-like-to-simplify-my-cicd-or-even-my-training-code-keeping-them-cloud-agnostic-by-using-dvc-pull-inside-my-docker-container-when-initializing-a-training-job--can-dvc-be-used-in-this-way" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/581256265234251776" target="_blank" rel="nofollow noopener noreferrer">I’m implementing a CI/CD and I would like to simplify my CI/CD or even my training code (keeping them cloud agnostic) by using <code>dvc pull</code> inside my Docker container when initializing a training job. </a> Can DVC be used in this way?<a href="#q-im-implementing-a-cicd-and-i-would-like-to-simplify-my-cicd-or-even-my-training-code-keeping-them-cloud-agnostic-by-using-dvc-pull-inside-my-docker-container-when-initializing-a-training-job--can-dvc-be-used-in-this-way" aria-label="q im implementing a cicd and i would like to simplify my cicd or even my training code keeping them cloud agnostic by using dvc pull inside my docker container when initializing a training job can dvc be used in this way permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes, it’s definitely a valid case for DVC. There are different ways of
organizing the storage that training machines are using to access data. From the
very simple — using local storage volume and pulling data from the remote
storage every time — to using NAS or EFS to store a shared DVC cache.</p>
<h3 id="q-i-was-able-to-follow-the-getting-started-examples-however-now-i-am-trying-to-push-my-data-to-github-i-keep-getting-the-following-error-error-failed-to-push-data-to-the-cloud--upload-is-not-supported-by-https-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/598866528984891403" target="_blank" rel="nofollow noopener noreferrer">I was able to follow the getting started examples, however now I am trying to push my data to Github, I keep getting the following error: “ERROR: failed to push data to the cloud — upload is not supported by https remote”.</a><a href="#q-i-was-able-to-follow-the-getting-started-examples-however-now-i-am-trying-to-push-my-data-to-github-i-keep-getting-the-following-error-error-failed-to-push-data-to-the-cloud--upload-is-not-supported-by-https-remote" aria-label="q i was able to follow the getting started examples however now i am trying to push my data to github i keep getting the following error error failed to push data to the cloud upload is not supported by https remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>HTTP remotes do not support upload yet. Example Get Started repository is using
HTTP to keep it read-only and abstract the actual storage provider we are using
internally. If you actually check the remote URL, you should see that it is an
S3 bucket and AWS provides an HTTP end-point to read data from it.</p>
<h3 id="q-im-looking-to-configure-aws-s3-as-a-storage-for-dvc-ive-set-up-the-remotes-and-initialized-dvc-in-the-git-repository-i-tried-testing-it-by-pushing-a-dataset-in-the-form-of-an-excel-file-the-command-completed-without-any-issues-but-this-is-what-im-seeing-in-s3-dvc-seems-to-have-created-a-subdirectory-in-the-intended-directory-called-35-where-it-placed-this-file-with-a-strange-name" style="position:relative;">Q: I’m looking to configure AWS S3 as a storage for DVC. I’ve set up the remotes and initialized dvc in the git repository. I tried testing it by pushing a dataset in the form of an excel file. The command completed without any issues but this is what I’m seeing in S3. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/585967551708921856" target="_blank" rel="nofollow noopener noreferrer">DVC seems to have created a subdirectory in the intended directory called “35” where it placed this file with a strange name.</a><a href="#q-im-looking-to-configure-aws-s3-as-a-storage-for-dvc-ive-set-up-the-remotes-and-initialized-dvc-in-the-git-repository-i-tried-testing-it-by-pushing-a-dataset-in-the-form-of-an-excel-file-the-command-completed-without-any-issues-but-this-is-what-im-seeing-in-s3-dvc-seems-to-have-created-a-subdirectory-in-the-intended-directory-called-35-where-it-placed-this-file-with-a-strange-name" aria-label="q im looking to configure aws s3 as a storage for dvc ive set up the remotes and initialized dvc in the git repository i tried testing it by pushing a dataset in the form of an excel file the command completed without any issues but this is what im seeing in s3 dvc seems to have created a subdirectory in the intended directory called 35 where it placed this file with a strange name permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is not an issue, it is an implementation detail. There’s no current way to
upload the files with the original filename (In this case, the S3 bucket will
have the file <code>data.csv</code> but with another name <code>20/893143…</code>). The reason behind
this decision is because we want to store a file only once no matter how many
dataset versions it’s used in. Also, it’s a reliable way to uniquely identify
the file. You don’t have to be afraid that someone decided to create a file with
the same name (path) but a different content.</p>
<h3 id="q-is-it-possible-to-only-have-a-shared-local-cache-and-no-remote-im-trying-to-figure-out-how-to-use-this-in-a-40-node-cluster-which-already-has-very-fast-nfs-storage-across-all-the-nodes-not-storing-everything-twice-seems-desirable-esp-for-the-multi-tb-input-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/587730054893666326" target="_blank" rel="nofollow noopener noreferrer">Is it possible to only have a shared ‘local’ cache and no remote?</a> I’m trying to figure out how to use this in a 40 node cluster which already has very fast NFS storage across all the nodes. Not storing everything twice seems desirable. Esp. for the multi-TB input data<a href="#q-is-it-possible-to-only-have-a-shared-local-cache-and-no-remote-im-trying-to-figure-out-how-to-use-this-in-a-40-node-cluster-which-already-has-very-fast-nfs-storage-across-all-the-nodes-not-storing-everything-twice-seems-desirable-esp-for-the-multi-tb-input-data" aria-label="q is it possible to only have a shared local cache and no remote im trying to figure out how to use this in a 40 node cluster which already has very fast nfs storage across all the nodes not storing everything twice seems desirable esp for the multi tb input data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes and it’s one of the very common use case, actually. All you need to do is to
use dvc cache dir command to setup an external cache. There are few caveats
though. Please, read
<a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein" target="_blank" rel="nofollow noopener noreferrer">this link</a>
for an example of the workflow.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/june-19-dvc-heartbeathttps://dvc.org/blog/june-19-dvc-heartbeatWed, 26 Jun 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We want to start by saying to our users, contributors, and community members how
grateful we are for the fantastic work you are doing contributing to DVC, giving
talks about DVC, sharing your feedback, use cases and your concerns. A huge
thank you to each of you from the DVC team!</p>
<p>We would love to give back and support any positive initiative around DVC — just
let us know <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and we will send you a bunch of cool
swag, connect to a tech expert or find another way to support your project. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are open, too.</p>
<p><strong>And if you have 4 minutes to spare, we are conducting out first
<a href="https://docs.google.com/forms/d/1tmn8YHLUkeSi5AIq4DGJi28iZy9HTazl6DWKe3Hxpnc/edit?ts=5cfc47c2" target="_blank" rel="nofollow noopener noreferrer">DVC user survey</a>
and would love to hear from you!</strong></p>
<p>Aside from admiring great DVC-related content from our users we have one more
reason to particularly enjoy the past month — DVC team went to Cleveland to
attend <a href="https://us.pycon.org/2019/about/" target="_blank" rel="nofollow noopener noreferrer">PyCon 2019</a> and it was a blast!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b123f78f23b67bb29be863d7452154a3/03346/cleveland-to-attend-pycon-2019.jpg" alt="cleveland to attend pycon 2019" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Amazing
<a href="https://github.com/sureL" target="_blank" rel="nofollow noopener noreferrer">Jennifer</a> and her artwork for our
<a href="https://twitter.com/hashtag/SupportOpenSource" target="_blank" rel="nofollow noopener noreferrer">SupportOpenSource</a> contest</em></p>
<p>We had it all. Running our first ever conference booth, leading an impromptu
unconference discussion and arranging some cool
<a href="https://twitter.com/hashtag/SupportOpenSource?src=hashtag_click" target="_blank" rel="nofollow noopener noreferrer">#SupportOpenSource</a>
activities was great! Last-minute accommodation cancellations, booth equipment
delivery issues, and being late for our very own talk was not so great. Will be
sharing more about it in a separate blogpost soon.</p>
<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jkfh2PM5Sz8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>
<p>Here is <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a>’s PyCon
<a href="https://www.youtube.com/watch?v=jkfh2PM5Sz8" target="_blank" rel="nofollow noopener noreferrer">talk</a> and
<a href="https://docs.google.com/presentation/d/1CYt0w8WoZAXiQEtVDVDsTnQumzdZx91v32MwEK20R-E/edit" target="_blank" rel="nofollow noopener noreferrer">slides</a>
on Machine learning model and dataset versioning practices.</p>
<p>We absolutely loved being at PyCon and can’t wait for our next conference!</p>
<hr>
<p>Our team is so happy every time we discover an article featuring DVC or
addressing one of the burning ML issues we are trying to solve. Here are some of
the links that caught our eye past month:</p>
<ul>
<li><strong><a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" target="_blank" rel="nofollow noopener noreferrer">The Rise of DataOps (from the ashes of Data Governance)</a>
by <a href="https://towardsdatascience.com/@ryanwgross" target="_blank" rel="nofollow noopener noreferrer">Ryan Gross</a>.</strong></li>
</ul>
<p>A brilliant comprehensive read on the current data management issues. It might
be the best article we have ever read on this subject. Every word strongly
resonates with our vision and ideas behind DVC. Highly recommended by DVC team!</p>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The Rise of DataOps (from the ashes of Data Governance)</h4>
<div class="elp-description">Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive…</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-06-26/the-rise-of-data-ops-1966d840bf0394acafc57223d40c26d2.png" alt="The Rise of DataOps (from the ashes of Data Governance)">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>Legacy Data Governance is broken in the ML era. Let’s rebuild it as an
engineering discipline. At the end of the transformation, data governance will
look a lot more like DevOps, with data stewards, scientists, and engineers
working closely together to codify the governance policies.</p>
</blockquote>
<ul>
<li><strong><a href="https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda" target="_blank" rel="nofollow noopener noreferrer">First Impressions of Data Science Version Control (DVC)</a>
by <a href="https://christophergs.github.io/" target="_blank" rel="nofollow noopener noreferrer">Christopher Samiullah</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">First Impressions of Data Science Version Control (DVC)</h4>
<div class="elp-description">A Powerful New Machine Learning Tool</div>
<div class="elp-link">medium.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-06-26/first-impressions-of-data-science-version-control-e96f9af1cebb895e023e79e4de6eb0f3.png" alt="First Impressions of Data Science Version Control (DVC)">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>In 2019, we tend to find organizations using a mix of git, Makefiles, ad hoc
scripts and reference files to try and achieve reproducibility. DVC enters
this mix offering a cleaner solution, specifically targeting Data Science
challenges.</p>
</blockquote>
<ul>
<li><strong><a href="https://github.com/peopledoc/mlvtools-tutorial" target="_blank" rel="nofollow noopener noreferrer">Versioning and Reproducibility with MLV-tools and DVC</a>:
<a href="https://peopledoc.github.io/mlvtools-tutorial/talks/pyData/presentation.html#/" target="_blank" rel="nofollow noopener noreferrer">Talk</a>
and
<a href="https://peopledoc.github.io/mlvtools-tutorial/talks/workshop/presentation.html#/" target="_blank" rel="nofollow noopener noreferrer">Tutorial</a>
by <a href="https://github.com/sbracaloni" target="_blank" rel="nofollow noopener noreferrer">Stéphanie Bracaloni</a> and
<a href="https://github.com/SdgJlbl" target="_blank" rel="nofollow noopener noreferrer">Sarah Diot-Girard</a>.</strong></li>
</ul>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/72397df92519affe8d30d67d72539d3f/39600/versioning-and-reproducibility-with-mlv-tools.png" alt="versioning and reproducibility with mlv tools" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<ul>
<li><strong><a href="https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies" target="_blank" rel="nofollow noopener noreferrer">Becoming a machine learning company means investing in foundational technologies</a>
by <a href="https://www.oreilly.com/people/4e7ad-ben-lorica" target="_blank" rel="nofollow noopener noreferrer">Ben Lorica</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Becoming a machine learning company means investing in foundational technologies</h4>
<div class="elp-description">Get expert knowledge on the tools and technologies you need to put your data strategies to work. Join us at the…</div>
<div class="elp-link">oreilly.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-06-26/becoming-a-machine-learning-company-307760aa3e556f62ddc35f90eec73eed.jpeg" alt="Becoming a machine learning company means investing in foundational technologies">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>With an eye toward the growing importance of machine learning, we recently
completed
<a href="https://www.oreilly.com/data/free/evolving-data-infrastructure.csp" target="_blank" rel="nofollow noopener noreferrer">a data infrastructure survey</a>
that drew more than 3,200 respondents.</p>
</blockquote>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-does-dvc-support-azure-data-lake-gen1" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575655655629651968" target="_blank" rel="nofollow noopener noreferrer">Does DVC support Azure Data Lake Gen1?</a><a href="#q-does-dvc-support-azure-data-lake-gen1" aria-label="q does dvc support azure data lake gen1 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try
and let us know if you hit any problems <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-an-excellent-discussion-on-versioning-tabular-sql-data-do-you-know-of-any-tools-that-deal-better-with-sql-specific-versioning" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748" target="_blank" rel="nofollow noopener noreferrer">An excellent discussion on versioning tabular (SQL) data.</a> Do you know of any tools that deal better with SQL-specific versioning?<a href="#q-an-excellent-discussion-on-versioning-tabular-sql-data-do-you-know-of-any-tools-that-deal-better-with-sql-specific-versioning" aria-label="q an excellent discussion on versioning tabular sql data do you know of any tools that deal better with sql specific versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It’s a wide topic. The actual solution might depend on a specific scenario and
what exactly needs to be versioned. DVC does not provide any special
functionality on top of databases to version their content.</p>
<p>Depending on your use case, our recommendation would be to run SQL and pull the
result file (CSV/TSV file?) that then can be used to do analysis. This file can
be taken under DVC control. Alternatively, in certain cases source files (that
are used to populate the databases) can be taken under control and we can keep
versions of them, or track incoming updates.</p>
<p>Read the
<a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748" target="_blank" rel="nofollow noopener noreferrer">discussion</a>
to learn more.</p>
<h3 id="q-how-does-dvc-do-the-versioning-between-binary-files-is-there-a-binary-diff-similar-to-git-or-is-every-version-stored-distinctly-in-full" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575686711821205504" target="_blank" rel="nofollow noopener noreferrer">How does DVC do the versioning between binary files?</a> Is there a binary diff, similar to git? Or is every version stored distinctly in full?<a href="#q-how-does-dvc-do-the-versioning-between-binary-files-is-there-a-binary-diff-similar-to-git-or-is-every-version-stored-distinctly-in-full" aria-label="q how does dvc do the versioning between binary files is there a binary diff similar to git or is every version stored distinctly in full permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC is just saving every file as is, we don’t use binary diffs right now. There
won’t be a full directory (if you added just a few files to a 10M files
directory) duplication, though, since we treat every file inside as a separate
entity.</p>
<h3 id="q-is-there-a-way-to-pass-parameters-from-eg-dvc-repro-to-stages" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/576160840701575169" target="_blank" rel="nofollow noopener noreferrer">Is there a way to pass parameters from e.g. <code>dvc repro</code> to stages?</a><a href="#q-is-there-a-way-to-pass-parameters-from-eg-dvc-repro-to-stages" aria-label="q is there a way to pass parameters from eg dvc repro to stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The simplest option is to create a config file — json or whatnot — that your
scripts would read and your stages depend on.</p>
<h3 id="q-what-is-the-best-way-to-get-cached-output-files-from-different-branches-simultaneously-for-example-cached-tensorboard-files-from-different-branches-to-compare-experiments" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/577852740034625576" target="_blank" rel="nofollow noopener noreferrer">What is the best way to get cached output files from different branches simultaneously?</a> For example, cached tensorboard files from different branches to compare experiments.<a href="#q-what-is-the-best-way-to-get-cached-output-files-from-different-branches-simultaneously-for-example-cached-tensorboard-files-from-different-branches-to-compare-experiments" aria-label="q what is the best way to get cached output files from different branches simultaneously for example cached tensorboard files from different branches to compare experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There is a way to do that through our (still not officially released) API pretty
easily. Here is an
<a href="https://cdn.discordapp.com/attachments/563406153334128681/577894682722304030/dvc_get_output_files.py" target="_blank" rel="nofollow noopener noreferrer">example script</a>
how it could be done.</p>
<h3 id="q-docker-and-dvc-to-being-able-to-pushpull-data-we-need-to-run-a-git-clone-to-get-dvc-files-and-remote-definitions--but-we-worry-that-would-make-the-container-quite-heavy-since-it-contains-our-entire-project-history" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/583949033685516299" target="_blank" rel="nofollow noopener noreferrer">Docker and DVC.</a> To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).<a href="#q-docker-and-dvc-to-being-able-to-pushpull-data-we-need-to-run-a-git-clone-to-get-dvc-files-and-remote-definitions--but-we-worry-that-would-make-the-container-quite-heavy-since-it-contains-our-entire-project-history" aria-label="q docker and dvc to being able to pushpull data we need to run a git clone to get dvc files and remote definitions but we worry that would make the container quite heavy since it contains our entire project history permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>You can do <code>git clone — depth 1</code>, which will not download any history except the
latest commits.</p>
<h3 id="q-after-dvc-pushing-the-same-file-it-creates-multiple-copies-of-the-same-file-is-that-how-its-supposed-to-work" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/574133734136086559" target="_blank" rel="nofollow noopener noreferrer">After DVC pushing the same file, it creates multiple copies of the same file. Is that how it’s supposed to work?</a><a href="#q-after-dvc-pushing-the-same-file-it-creates-multiple-copies-of-the-same-file-is-that-how-its-supposed-to-work" aria-label="q after dvc pushing the same file it creates multiple copies of the same file is that how its supposed to work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you are pushing the same file, there are no copies pushed or saved in the
cache. DVC is using checksums to identify files, so if you add the same file
once again, it will detect that cache for it is already in the local cache and
wont copy it again to cache. Same with dvc push, if it sees that you already
have cache file with that checksum on your remote, it won’t upload it again.</p>
<h3 id="q-how-do-i-uninstall-dvc-on-mac-installed-via-pkg-installer" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/574941227624169492" target="_blank" rel="nofollow noopener noreferrer">How do I uninstall DVC on Mac (installed via <code>pkg</code> installer)?</a><a href="#q-how-do-i-uninstall-dvc-on-mac-installed-via-pkg-installer" aria-label="q how do i uninstall dvc on mac installed via pkg installer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Something like this should work:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">which</span> dvc
</span>/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc
<span class="token line"><span class="token input">$ </span><span class="token command">ls</span> <span class="token parameter variable">-la</span> /usr/local/bin/dvc
</span>/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc
<span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-f</span> /usr/local/bin/dvc
</span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/lib/dvc
</span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pkgutil <span class="token parameter variable">--forget</span> com.iterative.dvc</span></code></pre></div>
<h3 id="q-how-do-i-pull-from-a-public-s3-bucket-that-contains-dvc-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575236576309674024" target="_blank" rel="nofollow noopener noreferrer">How do I pull from a public S3 bucket (that contains DVC remote)?</a><a href="#q-how-do-i-pull-from-a-public-s3-bucket-that-contains-dvc-remote" aria-label="q how do i pull from a public s3 bucket that contains dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Just add public URL of the bucket as an HTTP endpoint. See
<a href="https://github.com/iterative/example-get-started/blob/master/.dvc/config" target="_blank" rel="nofollow noopener noreferrer">here</a>
for an example.
<a href="https://remote.dvc.org/get-started" target="_blank" rel="nofollow noopener noreferrer">https://remote.dvc.org/get-started</a> is made
to redirect to the S3 bucket anyone can read from.</p>
<h3 id="q-im-getting-the-same-error-over-and-over-about-locking-error-failed-to-lock-before-running-a-command--cannot-perform-the-cmd-since-dvc-is-busy-and-locked-please-retry-the-command-later" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101" target="_blank" rel="nofollow noopener noreferrer">I’m getting the same error over and over about locking:</a> <code>ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.</code><a href="#q-im-getting-the-same-error-over-and-over-about-locking-error-failed-to-lock-before-running-a-command--cannot-perform-the-cmd-since-dvc-is-busy-and-locked-please-retry-the-command-later" aria-label="q im getting the same error over and over about locking error failed to lock before running a command cannot perform the cmd since dvc is busy and locked please retry the command later permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Most likely it happens due to an attempt to run DVC on NFS that has some
configuration problems. There is a
<a href="https://github.com/iterative/dvc/issues/1918" target="_blank" rel="nofollow noopener noreferrer">well known problem with DVC on NFS</a>
— sometimes it hangs on trying to lock a file. The usual workaround for this
problem is to allocate DVC cache on NFS, but run the project (git clone, DVC
metafiles, etc) on the local file system. Read
<a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein" target="_blank" rel="nofollow noopener noreferrer">this answer</a>
to see how it can be setup.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are open, too.</p>https://dvc.org/blog/may-19-dvc-heartbeathttps://dvc.org/blog/may-19-dvc-heartbeatTue, 21 May 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>This section of DVC Heartbeat is growing with every new Issue and this is
already quite a good piece of news!</p>
<p>One of the most exciting things we want to share this month is acceptance of DVC
into the <a href="https://developers.google.com/season-of-docs/" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a>.
It is a new and unique program sponsored by Google that pairs technical writers
with open source projects to collaborate and improve the open source project
documentation. You can find the outline of DVC vision and project ideas in
<a href="https://blog.dataversioncontrol.com/dvc-project-ideas-for-google-summer-of-docs-2019-defe3a73b248" target="_blank" rel="nofollow noopener noreferrer">this dedicated blogpost</a>
and check the
<a href="https://developers.google.com/season-of-docs/docs/participants/" target="_blank" rel="nofollow noopener noreferrer">full list of participating open source organizations</a>.
Technically the
<a href="https://developers.google.com/season-of-docs/docs/timeline" target="_blank" rel="nofollow noopener noreferrer">program is starting in a few months</a>,
but there is already a fantastic increase in the amount of commits and
contributors, and we absolutely love it!</p>
<p>The other important milestone for us was the first offline meeting with our
distributed remote team. Working side by side and having non-Zoom meetings with
the team was amazing. Joining our forces to prepare for the upcoming conferences
turned out to be the most valuable, educating and uniting experience for the
whole team.</p>
<p>It’s a shame that our tech lead was unable to join us it due to another visa
denial. We do hope he will finally make it to the USA for the next big
conference.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/060f8f204b833689b1569a4162d67e3d/39600/the-world-is-changing.png" alt="the world is changing" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>While we were busy finalizing all the PyCon 2019 prep, our own
<a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> flew to New York to speak at
the
<a href="https://conferences.oreilly.com/artificial-intelligence/ai-ny" target="_blank" rel="nofollow noopener noreferrer">O’Reilly AI Conference</a>
about the
<a href="https://www.oreilly.com/library/view/artificial-intelligence-conference/9781492050544/video324691.html" target="_blank" rel="nofollow noopener noreferrer">Open Source tools for Machine Learning Models and Datasets versioning</a>.
Unfortunately the video is available for the registered users only (with a free
trial option) but you can have a look at Dmitry’s slides
<a href="https://www.slideshare.net/DmitryPetrov15/dvc-oreilly-artificial-intelligence-conference-2019-new-york" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 404px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bee9b4ed9981db1bf7eb9db8450fc8d1/39600/iterative-ai-twitter.png" alt="iterative ai twitter" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>We renamed our Twitter! Our old handle was a bit misleading and we moved from
@Iterativeai to <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a> (yet keep the old one for
future projects).</p>
<p>Our team is so happy every time we discover an article featuring DVC or
addressing one of the burning ML issues we are trying to solve. Here are some of
our favorite links from the past month:</p>
<ul>
<li><strong><a href="https://www.pythonpodcast.com/data-version-control-episode-206/" target="_blank" rel="nofollow noopener noreferrer">Version Control For Your Machine Learning Projects — Episode 206</a></strong>
by <strong><a href="https://www.linkedin.com/in/tmacey/" target="_blank" rel="nofollow noopener noreferrer">Tobias Macey</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://www.pythonpodcast.com/data-version-control-episode-206/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Version Control For Machine Learning Projects</h4>
<div class="elp-description">An interview with the creator of DVC about how it improves collaboration and reduces duplicate effort on data science…</div>
<div class="elp-link">pythonpodcast.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-05-21/version-control-for-your-machine-learning-projects-d29d6b83905b901e7573d865b78db914.png" alt="Version Control For Machine Learning Projects">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>Version control has become table stakes for any software team, but for machine
learning projects there has been no good answer for tracking all of the data
that goes into building and training models, and the output of the models
themselves. To address that need Dmitry Petrov built the Data Version Control
project known as DVC. In this episode he explains how it simplifies
communication between data scientists, reduces duplicated effort, and
simplifies concerns around reproducing and rebuilding models at different
stages of the projects lifecycle.</p>
</blockquote>
<ul>
<li><strong>Here is an
<a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" target="_blank" rel="nofollow noopener noreferrer">article</a>
by <a href="https://medium.com/@faviovazquez" target="_blank" rel="nofollow noopener noreferrer">Favio Vázquez</a> with a transcript of this
podcast episode.</strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data version control with DVC. What do the authors have to say?</h4>
<div class="elp-description">Data versioning is one of the most ignored features in data science projects, but that has to change. Here I’ll discuss…</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-05-21/data-version-control-with-dvc-e9c8eefcd560f601394a53c3c300bfe5.png" alt="Data version control with DVC. What do the authors have to say?">
</div>
</a>
</section>
<p></p>
<ul>
<li><strong><a href="https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8" target="_blank" rel="nofollow noopener noreferrer">Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis</h4>
<div class="elp-description">Some claim the machine learning field is in a crisis due to software tooling that’s insufficient to ensure repeatable…</div>
<div class="elp-link">towardsdatascience.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-05-21/why-git-and-git-lfs-is-not-enough-eaf6ce46d5fc3cf9d0672d03331d00b1.jpeg" alt="Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>With Git-LFS your team has better control over the data, because it is now
version controlled. Does that mean the problem is solved? Earlier we said the
“<em>key issue is the training data</em>”, but that was a lie. Sort of. Yes keeping
the data under version control is a big improvement. But is the lack of
version control of the data files the entire problem? No.</p>
</blockquote>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-this-might-be-a-favourite-gem-of-ours---our-engineers-are-so-fast-that-someone-assumed-they-were-bots" style="position:relative;">Q: This might be <a href="https://discordapp.com/channels/485586884165107732/485598848111083531/572960640122224640" target="_blank" rel="nofollow noopener noreferrer">a favourite gem of ours </a> — our engineers are so fast that someone assumed they were bots.<a href="#q-this-might-be-a-favourite-gem-of-ours---our-engineers-are-so-fast-that-someone-assumed-they-were-bots" aria-label="q this might be a favourite gem of ours our engineers are so fast that someone assumed they were bots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We feared that too until we met them in person. They appeared to be real (unless
bots also love Ramen now)!</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4926411413e184b4531924e6c0aeaf02/39600/bots-also-love-ramen-now.png" alt="bots also love ramen now" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<h3 id="q-is-this-the-best-way-to-track-data-with-dvc-when-code-and-data-are-separate-having-being-burned-by-this-a-couple-of-times-ie-accidentally-pushing-large-files-to-github-i-now-keep-my-code-and-data-separate" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/572974117351849997" target="_blank" rel="nofollow noopener noreferrer">Is this the best way to track data with DVC when code and data are separate?</a> Having being burned by this a couple of times, i.e accidentally pushing large files to GitHub, I now keep my code and data separate.<a href="#q-is-this-the-best-way-to-track-data-with-dvc-when-code-and-data-are-separate-having-being-burned-by-this-a-couple-of-times-ie-accidentally-pushing-large-files-to-github-i-now-keep-my-code-and-data-separate" aria-label="q is this the best way to track data with dvc when code and data are separate having being burned by this a couple of times ie accidentally pushing large files to github i now keep my code and data separate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Every time you run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to start tracking some data artifact, its path is
automatically added to the <code>.gitignore</code> file, as a result it is hard to commit
it to git by mistake — you would need to explicitly modify the <code>.gitignore</code>
first. The feature to track some external data is called
<a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">external outputs</a> (if
all you need is to track some data artifacts). Usually it is used when you have
some data on S3 or SSH and don’t want to pull it into your working space, but
it’s working even when your data is located on the same machine outside of the
repository.</p>
<h3 id="q-how-do-i-wrap-a-step-that-downloads-a-filedirectory-into-a-dvc-stage-i-want-to-ensure-that-it-runs-only-if-file-has-no-been-downloaded-yet" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/571342592508428289" target="_blank" rel="nofollow noopener noreferrer">How do I wrap a step that downloads a file/directory into a DVC stage?</a> I want to ensure that it runs only if file has no been downloaded yet<a href="#q-how-do-i-wrap-a-step-that-downloads-a-filedirectory-into-a-dvc-stage-i-want-to-ensure-that-it-runs-only-if-file-has-no-been-downloaded-yet" aria-label="q how do i wrap a step that downloads a filedirectory into a dvc stage i want to ensure that it runs only if file has no been downloaded yet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Use <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> to track and download the remote data first time and next time
when you do dvc repro if data has changed remotely. If you don’t want to track
remote changes (lock the data after it was downloaded), use <code>dvc run</code> with a
dummy dependency (any text file will do you do not touch) that runs an actual
wget/curl to get the data.</p>
<h3 id="q-how-do-i-show-a-pipeline-that-does-not-have-a-default-dvcfile-eg-i-assigned-all-files-names-manually-with--f-in-the-dvc-run-command-and-i-just-dont-have-dvcfile-anymore" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570943786151313408" target="_blank" rel="nofollow noopener noreferrer">How do I show a pipeline that does not have a default Dvcfile?</a> (e.g. I assigned all files names manually with <code>-f</code> in the <code>dvc run</code> command and I just don’t have <code>Dvcfile</code> anymore)<a href="#q-how-do-i-show-a-pipeline-that-does-not-have-a-default-dvcfile-eg-i-assigned-all-files-names-manually-with--f-in-the-dvc-run-command-and-i-just-dont-have-dvcfile-anymore" aria-label="q how do i show a pipeline that does not have a default dvcfile eg i assigned all files names manually with f in the dvc run command and i just dont have dvcfile anymore permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a
single stage as a target, for example:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">dvc</span> pipeline show — ascii model.dvc</span></code></pre></div>
<h3 id="q-dvc-hangs-or-im-getting-database-is-locked-issue" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570843482218823682" target="_blank" rel="nofollow noopener noreferrer">DVC hangs or I’m getting <code>database is locked</code> issue</a><a href="#q-dvc-hangs-or-im-getting-database-is-locked-issue" aria-label="q dvc hangs or im getting database is locked issue permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It’s a well known problem with NFS, CIFS (Azure) — they do not support file
locks properly which is required by the SQLLite engine to operate. The easiest
workaround — don’t create a DVC project on network attached partition. In
certain cases a fix can be made by changing mounting options, check
<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570276668694855690" target="_blank" rel="nofollow noopener noreferrer">this discussion</a>
for the Azure ML Service.</p>
<h3 id="q-how-do-i-use-dvc-if-i-use-a-separate-drive-to-store-the-data-and-a-smallfast-ssd-to-run-computations-i-dont-have-enough-space-to-bring-data-to-my-working-space" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570091809594671126" target="_blank" rel="nofollow noopener noreferrer">How do I use DVC if I use a separate drive to store the data and a small/fast SSD to run computations?</a> I don’t have enough space to bring data to my working space.<a href="#q-how-do-i-use-dvc-if-i-use-a-separate-drive-to-store-the-data-and-a-smallfast-ssd-to-run-computations-i-dont-have-enough-space-to-bring-data-to-my-working-space" aria-label="q how do i use dvc if i use a separate drive to store the data and a smallfast ssd to run computations i dont have enough space to bring data to my working space permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>An excellent question! The short answer is:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># To move your data cache to a big partition</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc cache dir</span> <span class="token parameter variable">--local</span> /path/to/an/external/partition
</span>
<span class="token comment"># To enable symlinks/harldinks to avoid actual copying</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> cache.type reflink, hardlink, symlink, copy
</span>
<span class="token comment"># To protect the cache</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> cache.protected <span class="token boolean">true</span></span></code></pre></div>
<p>The last one is highly recommended to make links in your working space read-only
to avoid corrupting the cache. Read more about different link types
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p>To add your data first time to the DVC cache, do a clone of the repository on a
big partition and run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to add your data. Then you can do <code>git pull</code>,
<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> on a small partition and DVC will create all the necessary links.</p>
<h3 id="q-why-im-getting-paths-for-outs-overlap-error-when-i-run-dvc-add-or-dvc-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/571335064374345749" target="_blank" rel="nofollow noopener noreferrer">Why I’m getting <code>Paths for outs overlap</code> error when I run <code>dvc add</code> or <code>dvc run</code>?</a><a href="#q-why-im-getting-paths-for-outs-overlap-error-when-i-run-dvc-add-or-dvc-run" aria-label="q why im getting paths for outs overlap error when i run dvc add or dvc run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Usually it means that a parent directory of one of the arguments for <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> /
<code>dvc run</code> is already tracked. For example, you’ve added the whole datasets
directory already. And now you are trying to add a subdirectory, which is
already tracked as a part of the datasets one. No need to do that. You could
<a href="https://dvc.org/doc/command-reference/add"><code>dvc add datasets</code></a> or <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro datasets.dvc</code></a> to save changes.</p>
<h3 id="q-im-getting-ascii-codec-cant-encode-character-error-on-dvc-commands-when-i-deal-with-unicode-file-names" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/567310354766495747" target="_blank" rel="nofollow noopener noreferrer">I’m getting <code>ascii codec can’t encode character</code> error on DVC commands when I deal with unicode file names</a><a href="#q-im-getting-ascii-codec-cant-encode-character-error-on-dvc-commands-when-i-deal-with-unicode-file-names" aria-label="q im getting ascii codec cant encode character error on dvc commands when i deal with unicode file names permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><a href="https://perlgeek.de/en/article/set-up-a-clean-utf8-environment" target="_blank" rel="nofollow noopener noreferrer">Check the locale settings you have</a>
(<code>locale</code> command in Linux). Python expects a locale that can handle unicode
printing. Usually it’s solved with these commands: <code>export LC_ALL=en_US.UTF-8</code>
and <code>export LANG=en_US.UTF-8</code>. You can place those exports into <code>.bashrc</code> or
other file that defines your environment.</p>
<h3 id="q-does-dvc-use-the-same-logins-aws-cli-has-when-using-an-s3-bucket-as-its-reporemote-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/563149775340568576" target="_blank" rel="nofollow noopener noreferrer">Does DVC use the same logins <code>aws-cli</code> has when using an S3 bucket as its repo/remote storage</a>?<a href="#q-does-dvc-use-the-same-logins-aws-cli-has-when-using-an-s3-bucket-as-its-reporemote-storage" aria-label="q does dvc use the same logins aws cli has when using an s3 bucket as its reporemote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>In short — yes, but it can be also configured. DVC is going to use either your
default profile (from <code>~/.aws/*</code>) or your env vars by default. If you need more
flexibility (e.g. you need to use different credentials for different projects,
etc) check out
<a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html" target="_blank" rel="nofollow noopener noreferrer">this guide</a>
to configure custom aws profiles and then you could use them with DVC using
these
<a href="https://dvc.org/doc/commands-reference/remote/add#options" target="_blank" rel="nofollow noopener noreferrer">remote options</a>.</p>
<h3 id="q-how-can-i-output-multiple-metrics-from-a-single-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566000729505136661" target="_blank" rel="nofollow noopener noreferrer">How can I output multiple metrics from a single file?</a><a href="#q-how-can-i-output-multiple-metrics-from-a-single-file" aria-label="q how can i output multiple metrics from a single file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Let’s say I have the following in a file:</p>
<div class="gatsby-highlight" data-language="json"><pre class="language-json"><code class="language-json"><span class="token punctuation">{</span>
“AUC_RATIO”<span class="token operator">:</span>
<span class="token punctuation">{</span>
“train”<span class="token operator">:</span> <span class="token number">0.8922748258797667</span><span class="token punctuation">,</span>
“valid”<span class="token operator">:</span> <span class="token number">0.8561602726251776</span><span class="token punctuation">,</span>
“xval”<span class="token operator">:</span> <span class="token number">0.8843431199314923</span>
<span class="token punctuation">}</span>
<span class="token punctuation">}</span></code></pre></div>
<p>How can I show both <code>train</code> and <code>valid</code> without <code>xval</code>?</p>
<p>You can use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> command <code>--xpath</code> option and provide multiple
attribute names to it:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics show</span> metrics.json <span class="token punctuation">\</span>
<span class="token parameter variable">--type</span> json <span class="token punctuation">\</span>
<span class="token parameter variable">--xpath</span> AUC_RATIO<span class="token punctuation">[</span>train,valid<span class="token punctuation">]</span>
</span> metrics.json:
0.89227482588
0.856160272625</code></pre></div>
<h3 id="q-what-is-the-quickest-way-to-add-a-new-dependency-to-a-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566314479499870211" target="_blank" rel="nofollow noopener noreferrer">What is the quickest way to add a new dependency to a DVC-file?</a><a href="#q-what-is-the-quickest-way-to-add-a-new-dependency-to-a-dvc-file" aria-label="q what is the quickest way to add a new dependency to a dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are a few options to add a new dependency:</p>
<ul>
<li>
<p>simply opening a file with your favorite editor and adding a dependency there
without md5. DVC will understand that that stage is changed and will re-run
and re-calculate md5 checksums during the next DVC repro;</p>
</li>
<li>
<p>use <code>dvc run --no-exec</code> is another option. It will rewrite the existing file
for you with new parameters.</p>
</li>
</ul>
<h3 id="q-is-there-a-way-to-add-a-dependency-to-a-python-package-so-it-runs-a-stage-again-if-it-imported-the-updated-library" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566315265646788628" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add a dependency to a python package, so it runs a stage again if it imported the updated library?</a><a href="#q-is-there-a-way-to-add-a-dependency-to-a-python-package-so-it-runs-a-stage-again-if-it-imported-the-updated-library" aria-label="q is there a way to add a dependency to a python package so it runs a stage again if it imported the updated library permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>The only recommended way so far would be to somehow make DVC know about your
package’s version. One way to do that would be to create a separate stage that
would be dynamically printing version of that specific package into a file, that
your stage would depend on:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-o</span> mypkgver 'pip show mypkg <span class="token operator">></span> mypkgver’
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> mypkgver <span class="token parameter variable">-d</span> <span class="token punctuation">..</span>. <span class="token parameter variable">-o</span> <span class="token punctuation">..</span> mycmd</span></code></pre></div>
<h3 id="q-is-there-anyway-to-forcibly-recompute-the-hashes-of-dependencies-in-a-pipeline-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/564807276146458624" target="_blank" rel="nofollow noopener noreferrer">Is there anyway to forcibly recompute the hashes of dependencies in a pipeline DVC-file?</a><a href="#q-is-there-anyway-to-forcibly-recompute-the-hashes-of-dependencies-in-a-pipeline-dvc-file" aria-label="q is there anyway to forcibly recompute the hashes of dependencies in a pipeline dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>E.g. I made some whitespace/comment changes in my code and I want to tell DVC
“it’s ok, you don’t have to recompute everything”.</p>
<p>Yes, you could <a href="https://dvc.org/doc/command-reference/commit#-f"><code>dvc commit -f</code></a>. It will save all current checksum without
re-running your commands.</p>
<h3 id="q-i-have-projects-that-use-data-thats-stored-in-s3-i-never-have-data-locally-to-use-dvc-push-but-i-would-like-to-have-this-data-version-controlled-is-there-a-way-to-use-the-features-of-dvc-in-this-use-case" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/563352000281182218" target="_blank" rel="nofollow noopener noreferrer">I have projects that use data that’s stored in S3. I never have data locally to use <code>dvc push</code>, but I would like to have this data version controlled.</a> Is there a way to use the features of DVC in this use case?<a href="#q-i-have-projects-that-use-data-thats-stored-in-s3-i-never-have-data-locally-to-use-dvc-push-but-i-would-like-to-have-this-data-version-controlled-is-there-a-way-to-use-the-features-of-dvc-in-this-use-case" aria-label="q i have projects that use data thats stored in s3 i never have data locally to use dvc push but i would like to have this data version controlled is there a way to use the features of dvc in this use case permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Yes! This DVC features is called
<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">external outputs</a>
and
<a href="https://dvc.org/doc/user-guide/external-dependencies" target="_blank" rel="nofollow noopener noreferrer">external dependencies</a>.
You can use one of them or both to track, process, and version your data on a
cloud storage without downloading it locally.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know
<a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no
time!</p>https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019Tue, 23 Apr 2019 00:00:00 GMT<p>We strongly believe that well-shaped documentation is key for making the product
truly open. We have been investing lots of time and energy in improving our docs
lately. Being a team of 90% engineers we are eager to welcome the writers into
our team and our community. We are happy to share our experience, introduce them
to the world of open source and machine learning best practices, guide through
the OS contribution process and work together on improving our documentation.</p>
<p>DVC was started in late 2017 by a data scientist and an engineer. It is now
growing pretty fast and though our in-house team is quite small, we have to
thank our contributors (more than 80 in both code and docs) for developing DVC
with us. When working with DVC the technical writer will not only get lots of
hands-on experience in writing technical docs, but will also immerse into DVC
community — a warm and welcoming gathering of ML and DS enthusiasts and an
invaluable source of inspiration and expertise in ML engineering.</p>
<h3 id="about-dvc" style="position:relative;">About DVC<a href="#about-dvc" aria-label="about dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC is a brainchild of a data scientist and an engineer, that was created to
fill in the gaps in the ML processes tooling and evolved into a successful open
source project.</p>
<p>ML brings changes in development and research processes. These ML processes
require new tools for data versioning, ML pipeline versioning, resource
management for model training and others that haven’t been formalized. The
traditional software development tools do not fully cover ML team’s needs but
there are no good alternatives. It makes engineers to custom develop a new
toolset to manage data files, keep track of ML experiments and connect data and
source code together. The ML process becomes very fragile and requires tons of
tribal knowledge.</p>
<p>We have been working on <a href="http://DVC.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> by adopting best ML practices and
turning them into Git-like command line tool. DVC versions multi-gigabyte
datasets and ML models, make them shareable and reproducible. The tool helps to
organize a more rigorous process around datasets and the data derivatives. Your
favorite cloud storage (S3, GCS, or bare metal SSH server) could be used with
DVC as a data file backend.</p>
<p>If you are interested in learning a little bit more about DVC and its journey,
here is a great interview with DVC creator in the Episode 206 of
Podcast.<strong>init</strong>. Listen to it
<a href="https://www.pythonpodcast.com/data-version-control-episode-206/" target="_blank" rel="nofollow noopener noreferrer">HERE </a>or read
the transcript
<a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" target="_blank" rel="nofollow noopener noreferrer">HERE.</a></p>
<h3 id="the-state-of-dvc-documentation" style="position:relative;">The state of DVC documentation<a href="#the-state-of-dvc-documentation" aria-label="the state of dvc documentation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC is a pretty young project, developed and maintained solely by engineers. As
many OS projects we started from the bottom and for a long time our
<a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">documentation</a> was a bunch of bits and pieces. Nowadays
improving documentation is one of our top priorities. We moved to the new
in-house built documentation engine and started working with several technical
writers. Certain parts have been tremendously improved recently, e.g.
<a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a> and
<a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">certain parts of Commands Reference</a>
. So far most of our documentation has been written majorly by the engineering
team and there is need for improving the overall structure and making some parts
more friendly from a new user perspective. We have mostly complete
<a href="https://dvc.org/doc/commands-reference" target="_blank" rel="nofollow noopener noreferrer">reference documentation</a> for each
command, although some functions are missing good actionable examples. We also
have a <a href="https://dvc.org/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a>, however it is not in very
good shape. We strive for making our documentation clear and comprehensive for
users of various backgrounds and proficiency levels and this is where we do need
some fresh perspective.</p>
<h3 id="how-dvc-documentation-is-built" style="position:relative;">How DVC documentation is built<a href="#how-dvc-documentation-is-built" aria-label="how dvc documentation is built permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We have an open Github Apache-2 licensed repository for the
<a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC website</a>, the documentation engine
and the <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">documentation files</a>. The website
is built with Node.js + React, including the documentation engine (built
in-house).</p>
<p>Each documentation page is a static Markdown file in the repository, e.g.
<a href="https://github.com/iterative/dvc.org/blob/main/content/docs/command-reference/index.md" target="_blank" rel="nofollow noopener noreferrer">example here</a>.
It is rendered dynamically in the browser, no preprocessing is required. It
means that tech writers or contributors need to write/edit a Markdown file,
create a pull request and merge it into the master branch of the
<a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">repository.</a> The complete
<a href="https://github.com/iterative/dvc.org/blob/main/README.md#contributing" target="_blank" rel="nofollow noopener noreferrer">documentation contributing guide</a>
describes the directory structure and locations for the different documentation
parts.</p>
<h3 id="dvcs-approach-to-documentation-work" style="position:relative;">DVC’s approach to documentation work<a href="#dvcs-approach-to-documentation-work" aria-label="dvcs approach to documentation work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Documentation tasks and issues are maintained on our doc’s GitHub
<a href="https://github.com/iterative/dvc.org/issues" target="_blank" rel="nofollow noopener noreferrer">issue tracker</a>. Changes to the
documentation are made via pull requests on GitHub, and go through our standard
review process which is the same for documentation and code. A technical writer
would be trained in working with our current development process. It generally
means that tech writers or contributors need to write/edit a Markdown file, use
git and Github to create a pull request and publish it. The documentation
<a href="https://github.com/iterative/dvc.org/blob/main/README.md#contributing" target="_blank" rel="nofollow noopener noreferrer">contributing guide</a>
includes style conventions and other details. Documentation is considered of the
same importance as code. Engineering team has a policy to write or update the
relevant sections if something new is released. If it’s something too involved
engineers may create a ticket and ask for help. There is one maintainer who is
responsible for doing final reviews and merging the changes. In this sense, our
documentation is very similar to any other open source project.</p>
<h2 id="project-ideas-for-gsod19" style="position:relative;">Project ideas for GSoD’19<a href="#project-ideas-for-gsod19" aria-label="project ideas for gsod19 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We identified a number of ideas to work on and there are two major topics these
ideas fall into. Both topics are pretty broad and we don’t expect we can
completely cover them during this GSoD but hopefully we can make certain
progress.</p>
<p>First of all, we want to bring more structure and logic to our documentation to
improve user onboarding experience. The goal is for a new user to have a clear
path they can follow and understand what takeaways each part of the
documentation provides. In particular, improving how
<a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a>,
<a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Tutorials</a> and
<a href="https://dvc.org/doc/tutorials/versioning" target="_blank" rel="nofollow noopener noreferrer">Examples</a> relate to each other,
restructuring the existing <a href="https://dvc.org/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> to
explain basic concepts, and writing more use cases that resonate with ML
engineers and data scientists.</p>
<p>The other issue we would like to tackle is improving and expanding the existing
reference docs — commands descriptions, examples, etc. It involves filling in
the gaps and developing new sections, similar to
<a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">this one</a>. We would also love to
see more illustrative materials.</p>
<h3 id="project-1-improving-and-expanding-user-guide" style="position:relative;">Project 1: Improving and expanding User Guide<a href="#project-1-improving-and-expanding-user-guide" aria-label="project 1 improving and expanding user guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Description and details:</strong> Reviewing, restructuring and filling major gaps in
the User Guide (introductory parts of the basic concepts of DVC), e.g. have a
look at <a href="https://github.com/iterative/dvc.org/issues/144" target="_blank" rel="nofollow noopener noreferrer">this ticket</a> or
<a href="https://github.com/iterative/dvc.org/issues/53" target="_blank" rel="nofollow noopener noreferrer">this one</a>.</p>
<p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p>
<h3 id="project-2-expanding-and-developing-new-tutorials-and-use-cases" style="position:relative;">Project 2: Expanding and developing new tutorials and use cases.<a href="#project-2-expanding-and-developing-new-tutorials-and-use-cases" aria-label="project 2 expanding and developing new tutorials and use cases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Description and details:</strong> We already have some requests for more tutorials,
e.g. <a href="https://github.com/iterative/dvc.org/issues/96" target="_blank" rel="nofollow noopener noreferrer">this ticket</a>. Here is
another good <a href="https://github.com/iterative/dvc.org/issues/194" target="_blank" rel="nofollow noopener noreferrer">use case request</a>
. If you are going to work on this project you would need some domain knowledge,
preferably some basic ML or data science experience.</p>
<p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p>
<h3 id="project-3-improving-new-user-onboarding" style="position:relative;">Project 3: Improving new user onboarding<a href="#project-3-improving-new-user-onboarding" aria-label="project 3 improving new user onboarding permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Description and details:</strong> Analyze and restructure user walkthrough across
<a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get started</a>,
<a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Tutorials</a> and
<a href="https://dvc.org/doc/tutorials/versioning" target="_blank" rel="nofollow noopener noreferrer">Examples</a>. These three have one thing
in common — hands-on experience with DVC. If you choose this project, we will
work together to come up with a better location for the Examples (to move them
out of the Get Started shadow), and a better location for the Tutorials (to
reference external tutorials that were developed by our community members and
published on different platforms).</p>
<p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p>
<h3 id="project-4-improving-commands-reference" style="position:relative;">Project 4: Improving commands reference<a href="#project-4-improving-commands-reference" aria-label="project 4 improving commands reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Description and details:</strong> We will work on improving our
<a href="https://dvc.org/doc/commands-reference" target="_blank" rel="nofollow noopener noreferrer">Commands reference</a> section. This
includes expanding and filling in the gaps. One of the biggest pain points right
now are Examples. Users want them to be
<a href="https://github.com/iterative/dvc.org/issues/198" target="_blank" rel="nofollow noopener noreferrer">easy to run and try</a> and here
is a lot to be done in terms of improvement. We have a good example of how is
should be done <a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p>
<h3 id="project-5-describe-and-integrate-dvc-packages" style="position:relative;">Project 5: Describe and integrate “DVC packages”<a href="#project-5-describe-and-integrate-dvc-packages" aria-label="project 5 describe and integrate dvc packages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Description and details:</strong> Describe the brand new feature “DVC packages” and
integrate it with the rest of the documentation. We have been working hard to
release a few new commands to help with datasets management (have a look at
<a href="https://github.com/iterative/dvc/issues/1487" target="_blank" rel="nofollow noopener noreferrer">this ticket</a>). It’s a major
feature that deserves its place in the Get Started, Use cases, Commands
Reference, etc.</p>
<p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and
<a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p>
<p>The ideas we outline above are just an example of what we can work on. We are
open for any other suggestions and would like to work together with the
technical writer to make the contribution experience both useful and enjoyable
for all parties involved. If you have any suggestions or questions we would love
to hear from you => DVC.org/support and our DMs on
<a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> are always open!</p>
<hr>
<p>Special thanks to the <a href="https://numfocus.org/" target="_blank" rel="nofollow noopener noreferrer">NumFOCUS</a> for the ideas list
inspiration.</p>
<p>If you are a tech writer — check the
<a href="https://developers.google.com/season-of-docs/docs/tech-writer-guide" target="_blank" rel="nofollow noopener noreferrer">Technical writer guide</a>.
From April 30, 2019 you can see the list of participating open source
organizations on the <a href="https://g.co/seasonofdocs" target="_blank" rel="nofollow noopener noreferrer">Season of Docs website</a>. The
application period for technical writers opens on <strong>May 29, 2019</strong> and ends on
June 28, 2019.</p>https://dvc.org/blog/april-19-dvc-heartbeathttps://dvc.org/blog/april-19-dvc-heartbeatThu, 18 Apr 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We have some exciting news to share this month!</p>
<p>DVC is going to <a href="https://us.pycon.org/2019/" target="_blank" rel="nofollow noopener noreferrer">PyCon 2019</a>! It is the first
conference that we attend as a team. When we say ‘team’ — we mean it. Our
engineers are flying from all over the globe to get together offline and catch
up with fellow Pythonistas.</p>
<p>The <a href="https://us.pycon.org/2019/schedule/talks/list/" target="_blank" rel="nofollow noopener noreferrer">speaker pipeline</a> is
amazing! DVC creator Dmitry Petrov is giving a talk on
<a href="https://us.pycon.org/2019/schedule/presentation/176/" target="_blank" rel="nofollow noopener noreferrer">Machine learning model and dataset versioning practices</a>.</p>
<p>Stop by our booth at the Startup Row on Saturday, May 4, reach out and let us
know that you are willing to chat, or simply find a person with a huge DVC owl
on their shirt!</p>
<p>Speaking of the owls — DVC has done some rebranding recently and we love our new
logo. Special thanks to <a href="https://99designs.com/" target="_blank" rel="nofollow noopener noreferrer">99designs.com</a> for building a
great platform for finding trusted designers.</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/91d26fd1613290e118c7a4ad1fc5a088/39600/trusted-designers.png" alt="trusted designers" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>DVC is moving fast (almost as fast as my two-year-old). We do our best to keep
up and totally love all the buzz in our community channels lately!</p>
<p>Here is a number of interesting reads that caught our eye:</p>
<ul>
<li><strong><a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">A walkthrough of DVC</a>
by <a href="https://www.linkedin.com/in/bert-besser-284564182/" target="_blank" rel="nofollow noopener noreferrer">Bert Besser</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">A walkthrough of DVC — codecentric AG Blog</h4>
<div class="elp-description">This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves…</div>
<div class="elp-link">blog.codecentric.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-04-18/walkthrough-of-dvc-1c1b72dfeddae88a4249d5fefe8d3cc6.png" alt="A walkthrough of DVC — codecentric AG Blog">
</div>
</a>
</section>
<p></p>
<p>A great article about using DVC with a quite advanced scenario and docker. If
you haven’t had a chance to try <a href="http://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a> yet — this is a great
comprehensive read on why you should do so right away.</p>
<ul>
<li><strong><a href="https://github.com/EthicalML/state-of-mlops-2019" target="_blank" rel="nofollow noopener noreferrer">The state of machine learning operations</a>
by <a href="https://www.linkedin.com/in/axsaucedo/" target="_blank" rel="nofollow noopener noreferrer">Alejandro Saucedo</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://github.com/EthicalML/state-of-mlops-2019" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">The state of machine learning operations</h4>
<div class="elp-description">Contribute to EthicalML/state-of-mlops-2019 development by creating an account on GitHub.</div>
<div class="elp-link">github.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-04-18/the-state-of-machine-learning-operations-c6493fc09702d356e3cc7ced2711e3e3.jpeg" alt="The state of machine learning operations">
</div>
</a>
</section>
<p></p>
<p>A short (only 8 minutes!) and inspiring talk by Alejandro Saucedo at FOSDEM.
Alejandro covers the key trends in machine learning operations, as well as most
recent open source tools and frameworks. Focused on reproducibility, monitoring
and explainability, this lightning talk is a great snapshot of the current state
of ML operations.</p>
<ul>
<li><strong><a href="https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c" target="_blank" rel="nofollow noopener noreferrer">Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov</a>
by <a href="https://twitter.com/bhutanisanyam1" target="_blank" rel="nofollow noopener noreferrer">Sanyam Bhutani</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov</h4>
<div class="elp-description">Part 24 of The series where I interview my heroes.</div>
<div class="elp-link">hackernoon.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-04-18/interview-with-kaggle-grandmaster-d1bc437a22ebae88bba9e06d5f166c06.jpeg" alt="Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>There is no way you will become Kaggle Master and not learn how to approach
anew, the unknown problem in a fast hacking way with a very high number of
iterations per unit of time. This skill in the world of competitive learning
is the question of survival</p>
</blockquote>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We are sifting through the issues and discussions and share with you the most
interesting takeaways.</p>
<h3 id="q-what-are-the-system-requirements-to-install-dvc-type-of-operating-system-dependencies-of-another-application-as-git-memory-cpu-etc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/552098155861114891" target="_blank" rel="nofollow noopener noreferrer">What are the system requirements to install DVC (type of operating system, dependencies of another application (as GIT), memory, cpu, etc).</a><a href="#q-what-are-the-system-requirements-to-install-dvc-type-of-operating-system-dependencies-of-another-application-as-git-memory-cpu-etc" aria-label="q what are the system requirements to install dvc type of operating system dependencies of another application as git memory cpu etc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<ul>
<li>
<p>It supports Windows, Mac, Linux. Python 2 and 3.</p>
</li>
<li>
<p>No specific CPU or RAM requirements — it’s a lightweight command line tool and
should be able run pretty much everywhere you can run Python.</p>
</li>
<li>
<p>It depends on a few Python libraries that it installs as dependencies (they
are specified in the
<a href="https://github.com/iterative/dvc/blob/master/setup.py" target="_blank" rel="nofollow noopener noreferrer"><code>setup.py</code></a>).</p>
</li>
<li>
<p>It does not depend on Git and theoretically could be run without any SCM.
Running it on top of a Git repository however is recommended and gives you an
ability to actually save history of datasets, models, etc (even though it does
not put them into Git directly).</p>
</li>
</ul>
<h3 id="q-do-i-have-to-buy-a-server-license-to-run-dvc-do-you-have-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560212552638791706" target="_blank" rel="nofollow noopener noreferrer">Do I have to buy a server license to run DVC, do you have this?</a><a href="#q-do-i-have-to-buy-a-server-license-to-run-dvc-do-you-have-this" aria-label="q do i have to buy a server license to run dvc do you have this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>No server licenses for DVC. It is 100% free and open source.</p>
<h3 id="q-what-is-the-storage-limit-when-using-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560154903331340289" target="_blank" rel="nofollow noopener noreferrer">What is the storage limit when using DVC?</a><a href="#q-what-is-the-storage-limit-when-using-dvc" aria-label="q what is the storage limit when using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>I am trying to version control datasets and models with >10 GB (Potentially even
bigger). Can DVC handle this?</p>
<p>There is no limit. None enforced by DVC itself. It depends on the size of your
local or <a href="https://dvc.org/doc/commands-reference/remote" target="_blank" rel="nofollow noopener noreferrer">remote storages</a>. You
need to have some space available on S3, your SSH server or other storage you
are using to keep these data files, models and their version, which you would
like to store.</p>
<h3 id="q-how-does-dvc-know-the-sequence-of-stages-to-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/553731815228178433" target="_blank" rel="nofollow noopener noreferrer">How does DVC know the sequence of stages to run</a>?<a href="#q-how-does-dvc-know-the-sequence-of-stages-to-run" aria-label="q how does dvc know the sequence of stages to run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>How does it connect them? Does it see that there is a dependency which is
outputted from the first run?</p>
<p>DVC figures out the pipeline by looking at the dependencies and outputs of the
stages. For example, having the following:</p>
<p></p><div id="gist95747345" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-heartbeat-dvc-run-2019-04-sh" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-run-2019-04.sh content, created by SvetaGr on 05:45PM on April 16, 2019.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-run-2019-04.sh">
<tbody><tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC1" class="blob-code blob-code-inner js-file-line">$ dvc run -f download.dvc \</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC2" class="blob-code blob-code-inner js-file-line"> -o joke.txt \</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC3" class="blob-code blob-code-inner js-file-line"> "curl https://geek-jokes.sameerkumar.website/api > joke.txt"</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC4" class="blob-code blob-code-inner js-file-line">$ dvc run -f duplicate.dvc \</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC5" class="blob-code blob-code-inner js-file-line"> -d joke.txt \</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC6" class="blob-code blob-code-inner js-file-line"> -o dulpicate.txt \</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-run-2019-04-sh-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-heartbeat-dvc-run-2019-04-sh-LC7" class="blob-code blob-code-inner js-file-line"> "cat joke.txt joke.txt > duplicate.txt"</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14/raw/3802fa1b440a2b798568e0cac1be81ae10dd2acd/heartbeat-dvc-run-2019-04.sh" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14#file-heartbeat-dvc-run-2019-04-sh" class="Link--inTextBlock">
heartbeat-dvc-run-2019-04.sh
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>you will end up with two stages: <code>download.dvc</code> and <code>duplicate.dvc</code>. The
download one will have <code>joke.txt</code> as an output . The duplicate one defined
<code>joke.txt</code> as a dependency, as it is the same file. DVC detects that and creates
a pipeline by joining those stages.</p>
<p>You can inspect the content of each stage file
<a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">here</a> (they are human
readable).</p>
<h3 id="q-is-it-possible-to-use-the-same-data-of-a-remote-in-two-different-repositories" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560022999848321026" target="_blank" rel="nofollow noopener noreferrer">Is it possible to use the same data of a remote in two different repositories?</a><a href="#q-is-it-possible-to-use-the-same-data-of-a-remote-in-two-different-repositories" aria-label="q is it possible to use the same data of a remote in two different repositories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>(e.g. in one repo <a href="https://dvc.org/doc/command-reference/pull#-r"><code>run dvc pull -r my_remote</code></a> to pull some data and running the
same command in a different git repo should also pull the same)</p>
<p>Yes! It’s a frequent scenario for multiple repos to share remotes and even local
cache. DVC file serves as a link to the actual data. If you add the same DVC
file (e.g. <code>data.dvc</code>) to the new repo and do <a href="https://dvc.org/doc/command-reference/pull#-r"><code>dvc pull -r remotename data.dvc</code></a>-
it will fetch data. You have to use <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> first to specify the
coordinates of the remote storage you would like to share in every project.
Alternatively (check out the question below), you could use <code>--global</code> to
specify a single default remote (and/or cache dir) per machine.</p>
<h3 id="q-could-i-set-a-global-remote-server-instead-of-config-in-each-project" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/559653121228275727" target="_blank" rel="nofollow noopener noreferrer">Could I set a global remote server, instead of config in each project?</a><a href="#q-could-i-set-a-global-remote-server-instead-of-config-in-each-project" aria-label="q could i set a global remote server instead of config in each project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Use <code>--global</code> when you specify the remote settings. Then remote will be visible
for all projects on the same machine. <code>--global</code> — saves remote configuration to
the global config (e.g. <code>~/.config/dvc/config</code>) instead of a per project one —
<code>.dvc/config</code>. See more details
<a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<h3 id="q-how-do-i-version-a-large-dataset-in-s3-or-any-other-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/554679392823934977" target="_blank" rel="nofollow noopener noreferrer">How do I version a large dataset in S3 or any other storage?</a><a href="#q-how-do-i-version-a-large-dataset-in-s3-or-any-other-storage" aria-label="q how do i version a large dataset in s3 or any other storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>We would recommend to skim through our
<a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">get started</a> tutorial, to summarize the data
versioning process of DVC:</p>
<ul>
<li>You create stage (aka DVC) files by adding, importing files (<a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> /
<a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) , or run a command to generate files:</li>
</ul>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">--out</span> file.csv <span class="token string">"wget https://example.com/file.csv"</span></span></code></pre></div>
<ul>
<li>
<p>This stage files are tracked by <code>git</code></p>
</li>
<li>
<p>You use git to retrieve previous stage files (e.g. <code>git checkout v1.0</code>)</p>
</li>
<li>
<p>Then use <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> to retrieve all the files related by those stage files</p>
</li>
</ul>
<p>All your files (with each different version) are stored in a <code>.dvc/cache</code>
directory, that you sync with a remote file storage (for example, S3) using the
<a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> or <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> commands (analogous to a <code>git push</code> / <code>git pull</code>, but
instead of syncing your <code>.git</code>, you are syncing your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> directory) on a
remote repository (let’s say an S3 bucket).</p>
<h3 id="q-how-do-i-moverename-a-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/558216007684980736" target="_blank" rel="nofollow noopener noreferrer">How do I move/rename a DVC-file?</a><a href="#q-how-do-i-moverename-a-dvc-file" aria-label="q how do i moverename a dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>If you need to move your dvc file somewhere, it is pretty easy, even if done
manually:</p>
<p></p><div id="gist95752643" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-heartbeat-dvc-rename-sh" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-rename.sh content, created by SvetaGr on 12:45AM on April 17, 2019.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-rename.sh">
<tbody><tr>
<td id="file-heartbeat-dvc-rename-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-heartbeat-dvc-rename-sh-LC1" class="blob-code blob-code-inner js-file-line">$ mv my.dvc data/my.dvc</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-rename-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-heartbeat-dvc-rename-sh-LC2" class="blob-code blob-code-inner js-file-line"># and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'.</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4/raw/b9f920208a50afb55bda6c7527081babfcc323fe/heartbeat-dvc-rename.sh" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4#file-heartbeat-dvc-rename-sh" class="Link--inTextBlock">
heartbeat-dvc-rename.sh
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<h3 id="q-i-performed-dvc-push-of-a-file-to-a-remote-on-the-remote-there-is-created-a-directory-called-8f-with-a-file-inside-called-2ec34faf91ff15ef64abf3fbffa7ee-the-original-csv-file-doesnt-appear-on-the-remote-is-that-expected-behaviour" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/555431645402890255" target="_blank" rel="nofollow noopener noreferrer">I performed <code>dvc push</code> of a file to a remote. On the remote there is created a directory called <code>8f</code> with a file inside called <code>2ec34faf91ff15ef64abf3fbffa7ee</code>. The original CSV file doesn’t appear on the remote. Is that expected behaviour?</a><a href="#q-i-performed-dvc-push-of-a-file-to-a-remote-on-the-remote-there-is-created-a-directory-called-8f-with-a-file-inside-called-2ec34faf91ff15ef64abf3fbffa7ee-the-original-csv-file-doesnt-appear-on-the-remote-is-that-expected-behaviour" aria-label="q i performed dvc push of a file to a remote on the remote there is created a directory called 8f with a file inside called 2ec34faf91ff15ef64abf3fbffa7ee the original csv file doesnt appear on the remote is that expected behaviour permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>This is an expected behavior. DVC saves files under the name created from their
checksum in order to prevent duplication. If you delete “pushed” file in your
project directory and perform <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, DVC will take care of pulling the file
and renaming it to “original” name.</p>
<p>Below are some details about how DVC cache works, just to illustrate the logic.
When you add a data source:</p>
<p></p><div id="gist95752678" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-heartbeat-remote-file-naming-sh" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-remote-file-naming.sh content, created by SvetaGr on 12:49AM on April 17, 2019.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-remote-file-naming.sh">
<tbody><tr>
<td id="file-heartbeat-remote-file-naming-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-heartbeat-remote-file-naming-sh-LC1" class="blob-code blob-code-inner js-file-line">$ echo "foo" > data.txt</td>
</tr>
<tr>
<td id="file-heartbeat-remote-file-naming-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-heartbeat-remote-file-naming-sh-LC2" class="blob-code blob-code-inner js-file-line">$ dvc add data.txt</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e/raw/34017336326e3773f2e3a490e1f66265025f8c81/heartbeat-remote-file-naming.sh" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e#file-heartbeat-remote-file-naming-sh" class="Link--inTextBlock">
heartbeat-remote-file-naming.sh
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>It computes the (md5) checksum of the file and generates a DVC file with related
information:</p>
<p></p><div id="gist95752688" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-heartbeat-dvc-file-2019-04-yaml" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-yaml" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-file-2019-04.yaml content, created by SvetaGr on 12:50AM on April 17, 2019.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-file-2019-04.yaml">
<tbody><tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC1" class="blob-code blob-code-inner js-file-line">md5: 3bccbf004063977442029334c3448687</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC2" class="blob-code blob-code-inner js-file-line">outs:</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC3" class="blob-code blob-code-inner js-file-line">- cache: true</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC4" class="blob-code blob-code-inner js-file-line"> md5: d3b07384d113edec49eaa6238ad5ff00</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC5" class="blob-code blob-code-inner js-file-line"> metric: false</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC6" class="blob-code blob-code-inner js-file-line"> path: data.txt</td>
</tr>
<tr>
<td id="file-heartbeat-dvc-file-2019-04-yaml-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-heartbeat-dvc-file-2019-04-yaml-LC7" class="blob-code blob-code-inner js-file-line">wdir: ..</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980/raw/3ccd7b7ab89e1e4246c1d8c83d6051df2379bd6d/heartbeat-dvc-file-2019-04.yaml" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980#file-heartbeat-dvc-file-2019-04-yaml" class="Link--inTextBlock">
heartbeat-dvc-file-2019-04.yaml
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>The original file is moved to the cache and a link or copy (depending on your
filesystem) is created to replace it on your working space:</p>
<p></p><div id="gist95752708" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-heartbeat-cache-structure-2019-04-sh" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-cache-structure-2019-04.sh content, created by SvetaGr on 12:53AM on April 17, 2019.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-cache-structure-2019-04.sh">
<tbody><tr>
<td id="file-heartbeat-cache-structure-2019-04-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-heartbeat-cache-structure-2019-04-sh-LC1" class="blob-code blob-code-inner js-file-line">.dvc/cache</td>
</tr>
<tr>
<td id="file-heartbeat-cache-structure-2019-04-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-heartbeat-cache-structure-2019-04-sh-LC2" class="blob-code blob-code-inner js-file-line">└── d3</td>
</tr>
<tr>
<td id="file-heartbeat-cache-structure-2019-04-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-heartbeat-cache-structure-2019-04-sh-LC3" class="blob-code blob-code-inner js-file-line"> └── b07384d113edec49eaa6238ad5ff00</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9/raw/540aa50da9bb891da01030a8877688b74eecc20e/heartbeat-cache-structure-2019-04.sh" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9#file-heartbeat-cache-structure-2019-04-sh" class="Link--inTextBlock">
heartbeat-cache-structure-2019-04.sh
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<h3 id="q-is-it-possible-to-integrate-dvc-with-our-in-house-tools-developed-in-python" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/553570391000481802" target="_blank" rel="nofollow noopener noreferrer">Is it possible to integrate dvc with our in-house tools developed in Python?</a><a href="#q-is-it-possible-to-integrate-dvc-with-our-in-house-tools-developed-in-python" aria-label="q is it possible to integrate dvc with our in house tools developed in python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Absolutely! There are three ways you could interact with DVC:</p>
<ol>
<li>
<p>Use <a href="https://docs.python.org/3/library/subprocess.html" target="_blank" rel="nofollow noopener noreferrer">subprocess</a> to launch
DVC</p>
</li>
<li>
<p>Use <code>from dvc.main import main</code> and use it with regular CLI logic like
<code>ret = main(‘add’, ‘foo’)</code></p>
</li>
<li>
<p>Use our internal API (see <code>dvc/repo</code> and <code>dvc/command</code> in our source to get a
grasp of it). It is not officially public yet, and we don’t have any special
docs for it, but it is fairly stable and could definitely be used for a POC.
We’ll add docs and all the official stuff for it in the not-so-distant
future.</p>
</li>
</ol>
<h3 id="q-can-i-still-track-the-linkage-between-data-and-model-without-using-dvc-run-and-a-graph-of-tasks-basically-what-would-like-extremely-minimal-dvc-invasion-into-my-git-repo-for-an-existing-machine-learning-application" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/555750217522216990" target="_blank" rel="nofollow noopener noreferrer">Can I still track the linkage between data and model without using <code>dvc run</code></a> and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?<a href="#q-can-i-still-track-the-linkage-between-data-and-model-without-using-dvc-run-and-a-graph-of-tasks-basically-what-would-like-extremely-minimal-dvc-invasion-into-my-git-repo-for-an-existing-machine-learning-application" aria-label="q can i still track the linkage between data and model without using dvc run and a graph of tasks basically what would like extremely minimal dvc invasion into my git repo for an existing machine learning application permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There are two options:</p>
<ol>
<li>
<p>Use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to track models and/or input datasets. It should be enough if
you use <code>git commit</code> on DVC files produced by <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>. This is the very
minimum you can get with DVC and it does not require using DVC run. Check the
first part (up to the Pipelines/Add transformations section) of the DVC
<a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">get started</a>.</p>
</li>
<li>
<p>You could use <code>--no-exec</code> in <code>dvc run</code> and then just <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and
<code>git commit</code> the results. That way you’ll get your DVC files with all the
linkages, without having to actually run your commands through DVC.</p>
</li>
</ol>
<p>If you have any questions, concerns or ideas, let us know
<a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no
time.</p>https://dvc.org/blog/march-19-dvc-heartbeathttps://dvc.org/blog/march-19-dvc-heartbeatTue, 05 Mar 2019 00:00:00 GMT<p>This is the very first issue of the DVC❤️Heartbeat. Every month we will be
sharing our news, findings, interesting reads, community takeaways, and
everything along the way.</p>
<p>Some of those are related to our brainchild <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and its
journey. The others are a collection of exciting stories and ideas centered
around ML best practices and workflow.</p>
<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We read a ton of articles and posts every day and here are a few that caught our
eye. Well-written, offering a different perspective and definitely worth
checking.</p>
<ul>
<li><strong><a href="https://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">Data science is different now</a>
by <a href="https://veekaybee.github.io/" target="_blank" rel="nofollow noopener noreferrer">Vicki Boykis</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://veekaybee.github.io/2019/02/13/data-science-is-different/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data science is different now</h4>
<div class="elp-description">Woman holding a balance, Vermeer 1664 What do you think of when you read the phrase 'data science'? It's probably some…</div>
<div class="elp-link">veekaybee.github.io</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-03-05/data-science-is-different-now-ef77fccb7554382d75f7471a2564633f.png" alt="Data science is different now">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>What is becoming clear is that, in the late stage of the hype cycle, data
science is asymptotically moving closer to engineering, and the
<a href="https://www.youtube.com/watch?v=frQeK8xo9Ls" target="_blank" rel="nofollow noopener noreferrer">skills that data scientists need</a>
moving forward are less visualization and statistics-based, and
<a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">more in line with traditional computer science curricula</a>.</p>
</blockquote>
<ul>
<li><strong><a href="https://emilygorcenski.com/post/data-versioning/" target="_blank" rel="nofollow noopener noreferrer">Data Versioning</a> by
<a href="https://emilygorcenski.com/" target="_blank" rel="nofollow noopener noreferrer">Emily F. Gorcenski</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://emilygorcenski.com/post/data-versioning/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Data Versioning</h4>
<div class="elp-description">Productionizing machine learning/AI/data science is a challenge. Not only are the outputs of machine-learning…</div>
<div class="elp-link">emilygorcenski.com</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-03-05/data-versioning-44da0cbe3c804f68cee118e39b9ac318.jpeg" alt="Data Versioning">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>I want to explore how the degrees of freedom in versioning machine learning
systems poses a unique challenge. I’ll identify four key axes on which machine
learning systems have a notion of version, along with some brief
recommendations for how to simplify this a bit.</p>
</blockquote>
<ul>
<li><strong><a href="https://blog.mi.hdm-stuttgart.de/index.php/2019/02/26/reproducibility-in-ml/" target="_blank" rel="nofollow noopener noreferrer">Reproducibility in Machine Learning</a>
by <a href="https://blog.mi.hdm-stuttgart.de/index.php/author/pf023/" target="_blank" rel="nofollow noopener noreferrer">Pascal Fecht</a></strong></li>
</ul>
<p>
</p><section class="elp-content-holder">
<a href="https://emilygorcenski.com/post/data-versioning/" class="external-link-preview" target="_blank" rel="noopener noreferrer">
<div class="elp-description-holder">
<h4 class="elp-title">Reproducibility in Machine Learning | Computer Science Blog</h4>
<div class="elp-description">The rise of Machine Learning has led to changes across all areas of computer science. From a very abstract point of…</div>
<div class="elp-link">blog.mi.hdm-stuttgart.de</div>
</div>
<div class="elp-image-holder">
<img src="https://dvc.org/2019-03-05/reproducibility-in-machine-learning-4fa14e52fb2fa408a0b6870280e31566.jpeg" alt="Reproducibility in Machine Learning | Computer Science Blog">
</div>
</a>
</section>
<p></p>
<blockquote>
<p>…the objective of this post is not to philosophize about the dangers and
dark sides of AI. In fact, this post aims to work out common challenges in
reproducibility for machine learning and shows programming differences to
other areas of Computer Science. Secondly, we will see practices and workflows
to create a higher grade of reproducibility in machine learning algorithms.</p>
</blockquote>
<hr>
<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.</p>
<p>We will be sifting through the issues and discussions and share the most
interesting takeaways.</p>
<h3 id="q-edit-and-define-dvc-files-manually-in-a-makefile-style" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816" target="_blank" rel="nofollow noopener noreferrer">Edit and define DVC files manually, in a Makefile style</a><a href="#q-edit-and-define-dvc-files-manually-in-a-makefile-style" aria-label="q edit and define dvc files manually in a makefile style permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There is no separate guide for that, but it is very straight forward. See
<a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">DVC file format</a> description
for how DVC file looks inside in general. All <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or <code>dvc run</code> does is
just computing <code>md5</code> fields in it, that is all. You could write your DVC-file
and then run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> that will run a command(if any) and compute all needed
checksums,<a href="https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816" target="_blank" rel="nofollow noopener noreferrer">read more</a>.</p>
<h3 id="q-best-practices-to-define-the-code-dependencies" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915" target="_blank" rel="nofollow noopener noreferrer">Best practices to define the code dependencies</a><a href="#q-best-practices-to-define-the-code-dependencies" aria-label="q best practices to define the code dependencies permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>There’s a ton of code in that project, and it’s very non-trivial to define the
code dependencies for my training stage — there are a lot of imports going on,
the training code is distributed across many modules,
<a href="https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915" target="_blank" rel="nofollow noopener noreferrer">read more</a></p>
<h3 id="q-azure-data-lake-support" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/548495589428428801" target="_blank" rel="nofollow noopener noreferrer">Azure data lake support</a><a href="#q-azure-data-lake-support" aria-label="q azure data lake support permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>DVC officially only supports regular Azure blob storage. Gen1 Data Lake should
be accessible by the same interface, so configuring a regular azure remote for
DVC should work. Seems like Gen2 Data Lake
<a href="https://discordapp.com/channels/485586884165107732/485586884165107734/550546413197590539" target="_blank" rel="nofollow noopener noreferrer">has disable</a>
blob API. If you know more details about the difference between Gen1 and Gen2,
feel free to join <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">our community</a> and share this
knowledge.</p>
<h3 id="q-what-licence-dvc-is-released-under" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/542390986299539459" target="_blank" rel="nofollow noopener noreferrer">What licence DVC is released under</a><a href="#q-what-licence-dvc-is-released-under" aria-label="q what licence dvc is released under permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Apache 2.0. One of the <a href="https://opensource.org/licenses" target="_blank" rel="nofollow noopener noreferrer">most common</a> and
permissible OSS licences.</p>
<h3 id="q-setting-up-s3-compatible-remote" style="position:relative;">Q: Setting up S3 compatible remote<a href="#q-setting-up-s3-compatible-remote" aria-label="q setting up s3 compatible remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>(<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/543445798868746278" target="_blank" rel="nofollow noopener noreferrer">Localstack</a>,
<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/541466951474479115" target="_blank" rel="nofollow noopener noreferrer">wasabi</a>)</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> upstream s3://my-bucket
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> upstream region REGION_NAME
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> upstream endpointurl <span class="token operator"><</span>url<span class="token operator">></span></span></code></pre></div>
<p>Find and click the <code>S3 API compatible storage</code> on
<a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">this page</a></p>
<h3 id="q-why-dvc-creates-and-updates-gitignore-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/543914550173368332" target="_blank" rel="nofollow noopener noreferrer">Why DVC creates and updates <code>.gitignore</code> file?</a><a href="#q-why-dvc-creates-and-updates-gitignore-file" aria-label="q why dvc creates and updates gitignore file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>It adds your data files there, that are tracked by DVC, so that you don’t
accidentally add them to git as well you can open it with file editor of your
liking and see your data files listed there.</p>
<h3 id="q-managing-data-and-pipelines-with-dvc-on-hdfs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426" target="_blank" rel="nofollow noopener noreferrer">Managing data and pipelines with DVC on HDFS</a><a href="#q-managing-data-and-pipelines-with-dvc-on-hdfs" aria-label="q managing data and pipelines with dvc on hdfs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>With DVC, you could connect your data sources from HDFS with your pipeline in
your local project, by simply specifying it as an external dependency. For
example let’s say your script <code>process.cmd</code> works on an input file on HDFS and
then downloads a result to your local workspace, then with DVC it could look
something like:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> hdfs://example.com/home/shared/input <span class="token punctuation">\</span>
<span class="token parameter variable">-d</span> process.cmd <span class="token punctuation">\</span>
<span class="token parameter variable">-o</span> output process.cmd</span></code></pre></div>
<p><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426" target="_blank" rel="nofollow noopener noreferrer">read more</a>.</p>
<hr>
<p>If you have any questions, concerns or ideas, let us know
<a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no
time.</p>https://dvc.org/blog/ml-best-practices-in-pytorch-dev-conf-2018https://dvc.org/blog/ml-best-practices-in-pytorch-dev-conf-2018Thu, 18 Oct 2018 00:00:00 GMT<p>The issues discussed included applying traditional software development
techniques like unit testing, CI/CD systems, automated deployment, version
control, and more to the ML field. In this blog post, we will go over the best
practices ideas from PTDC-18 and the future of ML tool developments.</p>
<h2 id="1-engineering-practices-from-pytorch-developers" style="position:relative;">1. Engineering practices from PyTorch developers<a href="#1-engineering-practices-from-pytorch-developers" aria-label="1 engineering practices from pytorch developers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In the PTDC-18
<a href="https://www.facebook.com/pytorch/videos/482401942168584/" target="_blank" rel="nofollow noopener noreferrer">keynote speech</a>,
<strong>Jerome Pesenti</strong> described the motivation and goals of PyTorch project and
what the future of machine learning looks like.</p>
<h3 id="11-ml-tooling-future" style="position:relative;">1.1. ML tooling future<a href="#11-ml-tooling-future" aria-label="11 ml tooling future permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Regarding the future of ML, Jerome envisioned a “streamlined development, more
accessible tools, breakthrough hardware, and more”. Talking about the gap huge
gap between software engineering and ML engineering, Presenti said:</p>
<blockquote>
<p>Machine learning engineering is where we were in Software Engineering 20 years
ago. A lot of things still need to be invented. We need to figure out what
testing means, what CD (continuous delivery) means, we need to develop tools
and environments that people can develop <strong>robust ML that does not have too
many biases</strong> and does not overfit.</p>
</blockquote>
<p>In that gap lives many opportunities to develop new tools and services. We in
the ML ecosystem are called upon to implement the future of machine learning
tools. Traditional software engineering has many useful tools and techniques
which can either be repurposed for Machine Learning development or used as a
source for ideas in developing new tools.</p>
<h3 id="12-pytorch-motivation" style="position:relative;">1.2. PyTorch motivation<a href="#12-pytorch-motivation" aria-label="12 pytorch motivation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>PyTorch 1.0 implements one important engineering principle — “a seamless
transition from AI research to production”. It helps to move AI technology from
research into production as quickly as possible. In order to do that a few
challenges were solved:</p>
<ol>
<li>
<p><strong>Write code once</strong> — not have to rewrite or re-optimize code to go from
research to prod.</p>
</li>
<li>
<p><strong>Performance</strong> — training model on large datasets.</p>
</li>
<li>
<p><strong>Other languages</strong> — not only Python which is great for prototyping but also
C++ and other languages.</p>
</li>
<li>
<p><strong>Scaling</strong> — deploy PyTorch at scale more easily.</p>
</li>
</ol>
<h2 id="2-engineering-practices-for-software-20" style="position:relative;">2. Engineering practices for software 2.0<a href="#2-engineering-practices-for-software-20" aria-label="2 engineering practices for software 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<h3 id="21-melting-of-software-20-and-software-10" style="position:relative;">2.1. Melting of software 2.0 and software 1.0<a href="#21-melting-of-software-20-and-software-10" aria-label="21 melting of software 20 and software 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p><strong>Andrej Karpathy</strong> from Tesla AI had a
<a href="https://www.facebook.com/pytorch/videos/169366590639145/" target="_blank" rel="nofollow noopener noreferrer">dedicated talk</a> about
best engineering practices in ML. He drew a contrast between traditional
software development (software 1.0) with software utilizing Machine Learning
techniques (software 2.0), saying that</p>
<blockquote>
<p>“software 2.0 code also has new feature demands, contains bugs, and requires
iterations.”</p>
</blockquote>
<p>Meaning that ML development has a lifecycle similar to traditional software:</p>
<blockquote>
<p>“When you are working with these [neural] networks <strong>in production</strong> you are
doing much more than that [training and measuring models]. You maintaining the
codebase and that codebase is alive is just like 1.0 code.”</p>
</blockquote>
<p>Machine Learning models need to grow and develop feature-by-feature, bugs need
to be found and fixed, and repeatable processes are a must, as in earlier non-ML
software development practices.</p>
<h3 id="22-software-20-best-practices" style="position:relative;">2.2. Software 2.0 best practices<a href="#22-software-20-best-practices" aria-label="22 software 20 best practices permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>Karpathy went on to describe how software 1.0 best practices can be used in
software 2.0 (ML modeling):</p>
<ol>
<li>
<p><strong>Test-driven development</strong> — test/train dataset separation is not enough
since it describes only expected performance. Edge cases have to be tested to
ensure the model performs as required. That requires incorporating more
examples in datasets, or changing model architecture, or changing
optimization functions.</p>
</li>
<li>
<p><strong>Continues Integration and Continues Delivery</strong> (CI/CD) — Intelligently used
of CI/CD can propel a team into rapid agile development of software systems.
The phases of CI/CD jobs include: 1) ML model auto re-training when code or
dataset changes; 2) running unit-tests; 3) easy access to the last model; 4)
Auto-deployment to test and/or production systems.</p>
</li>
<li>
<p><strong>Version Control</strong> — track all the changes in datasets (labels), not only
code.</p>
</li>
<li>
<p>Train a <strong>single model</strong> from scratch every time without using other
pre-trained models. (External pre-trained models don’t count as far as I
understand.) A chain of fine-tuning models very quickly disintegrates
codebase. In software 1.0 a single <strong>monorepo</strong> is an analog of a single
model which also helps to avoid disintegration.</p>
</li>
</ol>
<p>This list of best practices shows how serious Tesla AI is about robust software
which is not surprising for self-driving car area. Any company needs these
practices in order to organize a manageable ML development process.</p>
<h2 id="3-data-file-centric-tools" style="position:relative;">3. Data file-centric tools<a href="#3-data-file-centric-tools" aria-label="3 data file centric tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Frameworks and libraries like PyTorch make a significant step in machine
learning tooling and bringing the best practices. However, frameworks and
libraries might be not enough for many of the ML best practices. For example,
dataset versioning, ML model versioning, continuous integration (CI) and
continuous delivery (CD) requires manipulation and transferring data files.
These can be done in a <strong>more efficient and natural way by data management
tools</strong> and storage systems rather than libraries.</p>
<p>The need for a machine learning artifact manipulation tool with <strong>data
file-centric philosophy</strong> was the major motivation behind open source project
that we created — Data Version Control (DVC) or <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a>.</p>
<p>DVC connects Git with data files and machine learning pipelines which helps keep
version control on machine learning models and datasets using familiar Git
semantics coupled with the power of cloud storage systems such as Amazon’s S3,
Google’s GCS, Microsoft’s Azure or bare-metal servers accessed by SSH.</p>
<p>If PyTorch helps in organizing code inside an ML project then data-centric tools
like DVC help organized different pieces of ML projects into a single workflow.
The machine learning future requires both types of tools — code level and data
file level.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Thus far only the first steps have been taken toward using machine learning
tooling and the best machine learning practices. Mostly large companies are
using these practices because they faced the problems a while ago. Best
practices should be embraced by the entire industry which will help to bring
machine learning to a higher new level.</p>https://dvc.org/blog/best-practices-of-orchestrating-python-and-r-code-in-ml-projectshttps://dvc.org/blog/best-practices-of-orchestrating-python-and-r-code-in-ml-projectsTue, 26 Sep 2017 00:00:00 GMT<p>Beside Git and shell scripting additional tools are developed to facilitate the
development of predictive model in a multi-language environments. For fast data
exchange between R and Python let’s use binary data file format
<a href="https://blog.rstudio.com/2016/03/29/feather/" target="_blank" rel="nofollow noopener noreferrer">Feather</a>. Another language
agnostic tool <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> can make the research reproducible — let’s
use DVC to orchestrate R and Python code instead of a regular shell scripts.</p>
<h2 id="machine-learning-with-r-and-python" style="position:relative;">Machine learning with R and Python<a href="#machine-learning-with-r-and-python" aria-label="machine learning with r and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Both R and Python are having powerful libraries/packages used for predictive
modeling. Usually algorithms used for classification or regression are
implemented in both languages and some scientist are using R while some of them
preferring Python. In an example that was explained in previous
<a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">tutorial</a>
target variable was binary output and logistic regression was used as a training
algorithm. One of the algorithms that could also be used for prediction is a
popular <a href="https://en.wikipedia.org/wiki/Random_forest" target="_blank" rel="nofollow noopener noreferrer">Random Forest algorithm</a>
which is implemented in both programming languages. Because of performances it
was decided that Random Forest classifier should be implemented in Python (it
shows better performances than random forest package in R).</p>
<h2 id="r-example-used-for-dvc-demo" style="position:relative;">R example used for DVC demo<a href="#r-example-used-for-dvc-demo" aria-label="r example used for dvc demo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We will use the same example from previous blog
<a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">story</a>,
add some Python codes and explain how Feather and DVC can simplify the
development process in this combined environment.</p>
<p>Let’s recall briefly the R codes from previous tutorial:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 335px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/68824bc8c4ac0c84edf737da9f1bfa01/31682/r-jobs.png" alt="R Jobs" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>R Jobs</em></p>
<p>Input data are StackOverflow posts — an XML file. Predictive variables are
created from text posts — relative importance
<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank" rel="nofollow noopener noreferrer">tf-idf</a> of words among all
available posts is calculated. With tf-idf matrices target is predicted and
lasso logistic regression for predicting binary output is used. AUC is
calculated on the test set and AUC metric is used on evaluation.</p>
<p>Instead of using logistic regression in R we will write Python jobs in which we
will try to use random forest as training model. Train_model.R and evaluate.R
will be replaced with appropriate Python jobs.</p>
<p>R codes can be seen
<a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p>
<p>Code for <code>train_model_Python.py</code> is presented below:</p>
<p></p><div id="gist73527556" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-train_model_python-py" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-python" style="overflow: auto" tabindex="0" role="region" aria-label="train_model_Python.py content, created by Zoldin on 06:52AM on August 02, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_model_Python.py">
<tbody><tr>
<td id="file-train_model_python-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-train_model_python-py-LC1" class="blob-code blob-code-inner js-file-line">import numpy as np</td>
</tr>
<tr>
<td id="file-train_model_python-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-train_model_python-py-LC2" class="blob-code blob-code-inner js-file-line">from sklearn.ensemble import RandomForestClassifier</td>
</tr>
<tr>
<td id="file-train_model_python-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-train_model_python-py-LC3" class="blob-code blob-code-inner js-file-line">import sys</td>
</tr>
<tr>
<td id="file-train_model_python-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-train_model_python-py-LC4" class="blob-code blob-code-inner js-file-line">try: import cPickle as pickle # python2</td>
</tr>
<tr>
<td id="file-train_model_python-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-train_model_python-py-LC5" class="blob-code blob-code-inner js-file-line">except: import pickle # python3</td>
</tr>
<tr>
<td id="file-train_model_python-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-train_model_python-py-LC6" class="blob-code blob-code-inner js-file-line">from scipy import sparse</td>
</tr>
<tr>
<td id="file-train_model_python-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-train_model_python-py-LC7" class="blob-code blob-code-inner js-file-line">from numpy import loadtxt</td>
</tr>
<tr>
<td id="file-train_model_python-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-train_model_python-py-LC8" class="blob-code blob-code-inner js-file-line">import feather as ft</td>
</tr>
<tr>
<td id="file-train_model_python-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-train_model_python-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model_python-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-train_model_python-py-LC10" class="blob-code blob-code-inner js-file-line">if len(sys.argv) != 4:</td>
</tr>
<tr>
<td id="file-train_model_python-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-train_model_python-py-LC11" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('Arguments error. Usage:\n')</td>
</tr>
<tr>
<td id="file-train_model_python-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-train_model_python-py-LC12" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('\tpython train_model.py INPUT_MATRIX_FILE SEED OUTPUT_MODEL_FILE\n')</td>
</tr>
<tr>
<td id="file-train_model_python-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-train_model_python-py-LC13" class="blob-code blob-code-inner js-file-line"> sys.exit(1)</td>
</tr>
<tr>
<td id="file-train_model_python-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-train_model_python-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model_python-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-train_model_python-py-LC15" class="blob-code blob-code-inner js-file-line">input = sys.argv[1]</td>
</tr>
<tr>
<td id="file-train_model_python-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-train_model_python-py-LC16" class="blob-code blob-code-inner js-file-line">seed = int(sys.argv[2])</td>
</tr>
<tr>
<td id="file-train_model_python-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-train_model_python-py-LC17" class="blob-code blob-code-inner js-file-line">output = sys.argv[3]</td>
</tr>
<tr>
<td id="file-train_model_python-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-train_model_python-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model_python-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-train_model_python-py-LC19" class="blob-code blob-code-inner js-file-line">df = ft.read_dataframe(input)</td>
</tr>
<tr>
<td id="file-train_model_python-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-train_model_python-py-LC20" class="blob-code blob-code-inner js-file-line">labels = df.loc[:,'label']</td>
</tr>
<tr>
<td id="file-train_model_python-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-train_model_python-py-LC21" class="blob-code blob-code-inner js-file-line">x = df.loc[:, df.columns != 'label']</td>
</tr>
<tr>
<td id="file-train_model_python-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-train_model_python-py-LC22" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model_python-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-train_model_python-py-LC23" class="blob-code blob-code-inner js-file-line">clf = RandomForestClassifier(n_estimators=100, n_jobs=2, random_state=seed)</td>
</tr>
<tr>
<td id="file-train_model_python-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-train_model_python-py-LC24" class="blob-code blob-code-inner js-file-line">clf.fit(x, labels.ix[:,0])</td>
</tr>
<tr>
<td id="file-train_model_python-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-train_model_python-py-LC25" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model_python-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-train_model_python-py-LC26" class="blob-code blob-code-inner js-file-line">with open(output, 'wb') as fd:</td>
</tr>
<tr>
<td id="file-train_model_python-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-train_model_python-py-LC27" class="blob-code blob-code-inner js-file-line"> pickle.dump(clf, fd)</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc/raw/8dad0f69067945b9b84f8d90a8cdbe52694e36f8/train_model_Python.py" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc#file-train_model_python-py" class="Link--inTextBlock">
train_model_Python.py
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>Also here we are adding code for <code>evaluation_python_model.py</code>:</p>
<p></p><div id="gist73527649" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-evaluation_python_model-py" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-python" style="overflow: auto" tabindex="0" role="region" aria-label="evaluation_python_model.py content, created by Zoldin on 06:54AM on August 02, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="evaluation_python_model.py">
<tbody><tr>
<td id="file-evaluation_python_model-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-evaluation_python_model-py-LC1" class="blob-code blob-code-inner js-file-line">from sklearn.metrics import precision_recall_curve</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-evaluation_python_model-py-LC2" class="blob-code blob-code-inner js-file-line">import sys</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-evaluation_python_model-py-LC3" class="blob-code blob-code-inner js-file-line">import sklearn.metrics as metrics</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-evaluation_python_model-py-LC4" class="blob-code blob-code-inner js-file-line">from scipy import sparse</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-evaluation_python_model-py-LC5" class="blob-code blob-code-inner js-file-line">from numpy import loadtxt</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-evaluation_python_model-py-LC6" class="blob-code blob-code-inner js-file-line">try: import cPickle as pickle # python2</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-evaluation_python_model-py-LC7" class="blob-code blob-code-inner js-file-line">except: import pickle # python3</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-evaluation_python_model-py-LC8" class="blob-code blob-code-inner js-file-line">import feather as ft</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-evaluation_python_model-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-evaluation_python_model-py-LC10" class="blob-code blob-code-inner js-file-line">if len(sys.argv) != 4:</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-evaluation_python_model-py-LC11" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('Arguments error. Usage:\n')</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-evaluation_python_model-py-LC12" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('\tpython metrics.py MODEL_FILE TEST_MATRIX METRICS_FILE\n')</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-evaluation_python_model-py-LC13" class="blob-code blob-code-inner js-file-line"> sys.exit(1)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-evaluation_python_model-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-evaluation_python_model-py-LC15" class="blob-code blob-code-inner js-file-line">model_file = sys.argv[1]</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-evaluation_python_model-py-LC16" class="blob-code blob-code-inner js-file-line">test_matrix_file = sys.argv[2]</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-evaluation_python_model-py-LC17" class="blob-code blob-code-inner js-file-line">metrics_file = sys.argv[3]</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-evaluation_python_model-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-evaluation_python_model-py-LC19" class="blob-code blob-code-inner js-file-line">with open(model_file, 'rb') as fd:</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-evaluation_python_model-py-LC20" class="blob-code blob-code-inner js-file-line"> model = pickle.load(fd)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-evaluation_python_model-py-LC21" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-evaluation_python_model-py-LC22" class="blob-code blob-code-inner js-file-line">df = ft.read_dataframe(test_matrix_file)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-evaluation_python_model-py-LC23" class="blob-code blob-code-inner js-file-line">labels = df.loc[:,'label']</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-evaluation_python_model-py-LC24" class="blob-code blob-code-inner js-file-line">x = df.loc[:, df.columns != 'label']</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-evaluation_python_model-py-LC25" class="blob-code blob-code-inner js-file-line">predictions_by_class = model.predict_proba(x)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-evaluation_python_model-py-LC26" class="blob-code blob-code-inner js-file-line">predictions = predictions_by_class[:,1]</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-evaluation_python_model-py-LC27" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-evaluation_python_model-py-LC28" class="blob-code blob-code-inner js-file-line">precision, recall, thresholds = precision_recall_curve(labels.ix[:,0], predictions)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-evaluation_python_model-py-LC29" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-evaluation_python_model-py-LC30" class="blob-code blob-code-inner js-file-line">auc = metrics.auc(recall, precision)</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-evaluation_python_model-py-LC31" class="blob-code blob-code-inner js-file-line">#print('AUC={}'.format(metrics.auc(recall, precision)))</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
<td id="file-evaluation_python_model-py-LC32" class="blob-code blob-code-inner js-file-line">with open(metrics_file, 'w') as fd:</td>
</tr>
<tr>
<td id="file-evaluation_python_model-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
<td id="file-evaluation_python_model-py-LC33" class="blob-code blob-code-inner js-file-line"> fd.write('AUC: {:4f}\n'.format(auc))</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4/raw/8b8837f0d5640e0c208ea1c4910d655d933b9bd0/evaluation_python_model.py" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4#file-evaluation_python_model-py" class="Link--inTextBlock">
evaluation_python_model.py
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>Let’s download necessary R and Python codes from above (clone the
<a href="https://github.com/Zoldin/R_AND_DVC" target="_blank" rel="nofollow noopener noreferrer">Github</a> repository):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">mkdir</span> R_DVC_GITHUB_CODE
</span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> R_DVC_GITHUB_CODE
</span>
<span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> https://github.com/Zoldin/R_AND_DVC</span></code></pre></div>
<p>Our dependency graph of this data science project look like this:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 250.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fbd7192868b16c9a421107083e2dd45b/09eb0/our-dependency-graph.png" alt="R (marked red) and Python (marked pink) jobs in one project" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>R
(marked red) and Python (marked pink) jobs in one project</em></p>
<p>Now lets see how it is possible to speed up and simplify process flow with
Feather API and data version control reproducibility.</p>
<h2 id="feather-api" style="position:relative;">Feather API<a href="#feather-api" aria-label="feather api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Feather API is designed to improve meta data and data interchange between R and
Python. It provides fast import/export of data frames among both environments
and keeps meta data information which is an improvement over data exchange via
csv/txt file format. In our example Python job will read an input binary file
that was produced in R with Feather api.</p>
<p>Let’s install Feather library in both environments.</p>
<p>For Python 3 on linux environment you can use cmd and pip3:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pip3 <span class="token function">install</span> feather-format</span></code></pre></div>
<p>For R it is necessary to install feather package:</p>
<div class="gatsby-highlight" data-language="r"><pre class="language-r"><code class="language-r">install.packages<span class="token punctuation">(</span>feather<span class="token punctuation">)</span></code></pre></div>
<p>After successful installation we can use Feather for data exchange.</p>
<p>Below is an R syntax for data frame export with Feather (featurization.R):</p>
<div class="gatsby-highlight" data-language="r"><pre class="language-r"><code class="language-r">library<span class="token punctuation">(</span>feather<span class="token punctuation">)</span>
write_feather<span class="token punctuation">(</span>dtm_train_tfidf<span class="token punctuation">,</span>args<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
write_feather<span class="token punctuation">(</span>dtm_test_tfidf<span class="token punctuation">,</span>args<span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
print<span class="token punctuation">(</span><span class="token string">"Two data frame were created with Feather - one for train and one for test data set"</span><span class="token punctuation">)</span></code></pre></div>
<p>Python syntax for reading feather input binary files (train_model_python.py):</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> feather <span class="token keyword">as</span> ft
<span class="token builtin">input</span> <span class="token operator">=</span> sys<span class="token punctuation">.</span>argv<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
df <span class="token operator">=</span> ft<span class="token punctuation">.</span>read_dataframe<span class="token punctuation">(</span><span class="token builtin">input</span><span class="token punctuation">)</span></code></pre></div>
<h2 id="dependency-graph-with-r-and-python-combined" style="position:relative;">Dependency graph with R and Python combined<a href="#dependency-graph-with-r-and-python-combined" aria-label="dependency graph with r and python combined permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The next question what we are asking ourselves is why do we need DVC, why not
just use shell scripting? DVC automatically derives the dependencies between the
steps and builds
<a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph" target="_blank" rel="nofollow noopener noreferrer">the dependency graph (DAG)</a>
transparently to the user. Graph is used for reproducing parts/codes of your
pipeline which were affected by recent changes and we don’t have to think all
the time what we need to repeat (which steps) with the latest changes.</p>
<p>Firstly, with <code>dvc run</code> command we will execute all jobs that are related to our
model development. In that phase DVC creates dependencies that will be used in
the reproducibility phase:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://s3-us-west-2.amazonaws.com/dvc-public/data/tutorial/nlp/25K/Posts.xml.zip <span class="token punctuation">\</span>
data/
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token function">tar</span> zxf data/Posts.xml.tgz <span class="token parameter variable">-C</span> data/
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/parsingxml.R <span class="token punctuation">\</span>
data/Posts.xml data/Posts.csv
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span>
data/Posts.csv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/train_post.csv data/test_post.csv
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/featurization.R <span class="token punctuation">\</span>
data/train_post.csv <span class="token punctuation">\</span>
data/test_post.csv data/matrix_train.feather <span class="token punctuation">\</span>
data/matrix_test.feather
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python3 code/train_model_python.py <span class="token punctuation">\</span>
data/matrix_train.feather <span class="token punctuation">\</span>
<span class="token number">20170426</span> data/model.p
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python3 code/evaluate_python_mdl.py <span class="token punctuation">\</span>
data/model.p data/matrix_test.feather <span class="token punctuation">\</span>
data/evaluation_python.txt</span></code></pre></div>
<p>After this commands jobs are executed and included in DAG graph. Result (AUC
metrics) is written in evaluation_python.txt file:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation_python.txt
</span>AUC: 0.741432</code></pre></div>
<p>It is possible to improve our result with random forest algorithm.</p>
<p>We can increase number of trees in the random forest classifier — from 100 to
500:</p>
<div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span>n_estimators<span class="token operator">=</span><span class="token number">500</span><span class="token punctuation">,</span>
n_jobs<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">,</span>
random_state<span class="token operator">=</span>seed<span class="token punctuation">)</span>
clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>x<span class="token punctuation">,</span> labels<span class="token punctuation">)</span></code></pre></div>
<p>After commited changes (in <code>train_model_python.py</code>) with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command all
necessary jobs for <code>evaluation_python.txt</code> reproduction will be re-executed. We
don’t need to worry which jobs to run and in which order.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span>
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span>
</span>[master a65f346] Random forest classifier — more trees added
1 file changed, 1 insertion(+), 1 deletion(-)
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation_python.txt
</span>
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p
Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt
Data item “data/evaluation_python.txt” was reproduced.</code></pre></div>
<p>Beside code versioning, DVC also cares about data versioning. For example, if we
change data sets <code>train_post.csv</code> and <code>test_post.csv</code> (use different splitting
ratio) DVC will know that data sets are changed and <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> will re-execute
all necessary jobs for evaluation_python.txt.</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span>
data/Posts.csv <span class="token number">0.15</span> <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/train_post.csv <span class="token punctuation">\</span>
data/test_post.csv</span></code></pre></div>
<p>Re-executed jobs are marked with red color:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 250.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/10053d985ed8b13cfb9b560ee5d2cc37/09eb0/re-executed-jobs.png" alt="re executed jobs" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span>
data/Posts.csv <span class="token number">0.15</span> <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/train_post.csv <span class="token punctuation">\</span>
data/test_post.csv
</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation_python.txt
</span>
Reproducing run command for data item data/matrix_train.txt. Args: Rscript — vanilla code/featurization.R data/train_post.csv data/test_post.csv data/matrix_train.txt data/matrix_test.txt
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p
Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt
Data item “data/evaluation_python.txt” was reproduced.
<span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation_python.txt
</span>AUC: 0.793145</code></pre></div>
<p>New AUC result is 0.793145 which shows an improvement compared to previous
iteration.</p>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In data science projects it is often used R/Python combined programming.
Additional tools beside git and shell scripting are developed to facilitate the
development of predictive model in a multi-language environments. Using data
version control system for reproducibility and Feather for data interoperability
helps you orchestrate R and Python code in a single environment.</p>https://dvc.org/blog/ml-model-ensembling-with-fast-iterationshttps://dvc.org/blog/ml-model-ensembling-with-fast-iterationsWed, 23 Aug 2017 00:00:00 GMT<p>In a model ensembling setup, the final prediction is a composite of predictions
from individual machine learning algorithms. To make the best model composite,
you have to try dozens of combinations of weights for the model set. It takes a
lot of time to come up with the best one. That is why the iteration speed is
crucial in the ML model ensembling. We are going to make our research
reproducible by using <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a> tool -
(<a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>). It provides the ability to quickly re-run and replicate
the ML prediction result by executing just a single command <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p>
<p>As we will demonstrate, DVC is a good tool that helps tackling common technical
challenges of building pipelines for the ensemble learning.</p>
<h2 id="project-overview" style="position:relative;">Project Overview<a href="#project-overview" aria-label="project overview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this case, we will build an R-based solution to attack the
supervised-learning regression problem to predict win sales per
<a href="https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/" target="_blank" rel="nofollow noopener noreferrer">Predict Wine Sales</a>
Kaggle competition.</p>
<p>An ensemble prediction methodology will be used in the project. The weighted
ensemble of three models will be implemented, trained, and predicted from
(namely, these are Linear Regression, <code>GBM</code>, and <code>XGBoost</code>).</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 435px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb9050a712d4a3f7fd006686b1f41fe2/39600/ensemble-prediction-methodology.png" alt="ensemble prediction methodology" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>If properly designed and used, ensemble prediction can perform much better then
predictions of individual machine learning models composing the ensemble.</p>
<p>Prediction results will be delivered in a format of output CSV file that is
specified in the requirements to the
<a href="https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/" target="_blank" rel="nofollow noopener noreferrer">Predict Wine Sales</a>
Kaggle competition (so called Kaggle submission file).</p>
<h2 id="important-pre-requisites" style="position:relative;">Important Pre-Requisites<a href="#important-pre-requisites" aria-label="important pre requisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In order to try the materials of this
<a href="https://github.com/gvyshnya/DVC_R_Ensemble" target="_blank" rel="nofollow noopener noreferrer">repository</a> in your environment,
the following software should be installed on your machine</p>
<ul>
<li>
<p><strong><em>Python 3</em></strong> runtime environment for your OS (it is required to run DVC
commands in the batch files)</p>
</li>
<li>
<p><strong><em>DVC</em></strong> itself (you can install it as a python package by simply doing the
standard command in your command line prompt: <code>pip install dvc</code>)</p>
</li>
<li>
<p><strong><em>R</em></strong> <strong><em>3.4.x</em></strong> runtime environment for your OS</p>
</li>
<li>
<p><strong><em>git</em></strong> command-line client application for your OS</p>
</li>
</ul>
<h2 id="technical-challenges" style="position:relative;">Technical Challenges<a href="#technical-challenges" aria-label="technical challenges permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The technical challenges of building the ML pipeline for this project were to
meet business requirements below</p>
<ul>
<li>
<p>Ability to conditionally trigger execution of 3 different ML prediction models</p>
</li>
<li>
<p>Ability to conditionally trigger model ensemble prediction based on
predictions of those 3 individual models</p>
</li>
<li>
<p>Ability to specify weights of each of the individual model predictions in the
ensemble</p>
</li>
<li>
<p>Quick and fast redeployment and re-run of the ML pipeline upon frequent
reconfiguration and model tweaks</p>
</li>
<li>
<p>Reproducibility of the pipeline and forecasting results across the multiple
machines and team members</p>
</li>
</ul>
<p>The next sections below will explain how these challenges are addressed in the
design of ML pipeline for this project.</p>
<h2 id="ml-pipeline" style="position:relative;">ML Pipeline<a href="#ml-pipeline" aria-label="ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The ML pipeline for this project is presented in the diagram below</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 365.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9cf20fd774b97331a5c6e17a1e92115b/39600/ml-pipeline.png" alt="ml pipeline" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>As you can see, the essential implementation of the solution is as follows</p>
<ul>
<li>
<p><a href="https://gist.github.com/gvyshnya/443424775b0150baac774cc6cf3cb1cc" target="_blank" rel="nofollow noopener noreferrer"><code>preprocessing.R</code></a>
handles all aspects of data manipulations and pre-processing (reading training
and testing data sets, removing outliers, imputing NAs etc.) as well as stores
refined training and testing set data as new files to reuse by model scripts</p>
</li>
<li>
<p>3 model scripts implement training and forecasting algorithms for each of the
models selected for this project
(<a href="https://gist.github.com/gvyshnya/7ec76316c24bc1b4f595ef1256f52d3a" target="_blank" rel="nofollow noopener noreferrer"><code>LR.R</code></a>,
<a href="https://gist.github.com/gvyshnya/50e5ea3efa9771d2e7cc121c2f1a04e4" target="_blank" rel="nofollow noopener noreferrer"><code>GBM.R</code></a>,
<a href="https://gist.github.com/gvyshnya/2e5799863f02fec652c194020da82dd3" target="_blank" rel="nofollow noopener noreferrer"><code>xgboost.R</code></a>)</p>
</li>
<li>
<p><a href="https://gist.github.com/gvyshnya/84379d6a68fd085fe3a26aabad453e55" target="_blank" rel="nofollow noopener noreferrer"><code>ensemble.R</code></a>
is responsible for the weighted ensemble prediction and the final output of
the Kaggle submission file</p>
</li>
<li>
<p><code>config.R</code> is responsible for all of the conditional logic switches needed in
the pipeline (it is included as a source to all of modeling and ensemble
prediction scripts, to get this done)</p>
</li>
</ul>
<p>There is a special note about lack of feature engineering for this project. It
was an intended specification related to the specifics of the dataset. The
existing features were quite instrumental to predict the target values ‘as is’.
Therefore it had been decided to follow the well-known
<a href="https://en.wikipedia.org/wiki/Pareto_principle" target="_blank" rel="nofollow noopener noreferrer">Pareto principle</a> (interpreted
as “<strong><em>20% of efforts address 80% of issues</em></strong>”, in this case) and not to spend
more time on it.</p>
<p><strong><em>Note</em></strong>: all <code>R</code> and batch files mentioned throughout this blog post are
available online in a separate GitHub
<a href="https://github.com/gvyshnya/DVC_R_Ensemble" target="_blank" rel="nofollow noopener noreferrer">repository</a>. You will be also able
to review more details on the implementation of each of the machine learning
prediction models there.</p>
<h3 id="pipeline-configuration-management" style="position:relative;">Pipeline Configuration Management<a href="#pipeline-configuration-management" aria-label="pipeline configuration management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>All of the essential tweaks to conditional machine learning pipeline for this
project is managed by a configuration file. For ease of its use across solution,
it was implemented as an R code module (<code>config.R</code>), to be included to all model
training and forecasting. Thus the respective parameters (assigned as R
variables) will be retrieved by the runnable scripts, and the conditional logic
there will be triggered respectively.</p>
<p>This file is not intended to run from a command line (unlike the rest of the R
scripts in the project).</p>
<p></p><div id="gist73938264" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-config-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="config.R content, created by gvyshnya on 03:27PM on August 06, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="config.R">
<tbody><tr>
<td id="file-config-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-config-r-LC1" class="blob-code blob-code-inner js-file-line"># Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/</td>
</tr>
<tr>
<td id="file-config-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-config-r-LC2" class="blob-code blob-code-inner js-file-line"># This is a configuration file to the entire solution </td>
</tr>
<tr>
<td id="file-config-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-config-r-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-config-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-config-r-LC4" class="blob-code blob-code-inner js-file-line"># LR.R specific settings</td>
</tr>
<tr>
<td id="file-config-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-config-r-LC5" class="blob-code blob-code-inner js-file-line">cfg_run_LR <- 1 # if set to 0, LR model will not fit, and its prediction will not be calculated in the batch mode</td>
</tr>
<tr>
<td id="file-config-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-config-r-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-config-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-config-r-LC7" class="blob-code blob-code-inner js-file-line"># GMB.R specific settings</td>
</tr>
<tr>
<td id="file-config-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-config-r-LC8" class="blob-code blob-code-inner js-file-line">cfg_run_GBM <- 1 # if set to 0, GBM model will not fit, and its prediction will not be calculated in the batch mode</td>
</tr>
<tr>
<td id="file-config-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-config-r-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-config-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-config-r-LC10" class="blob-code blob-code-inner js-file-line"># xgboost.R specific settings</td>
</tr>
<tr>
<td id="file-config-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-config-r-LC11" class="blob-code blob-code-inner js-file-line">cfg_run_xgboost <- 1 # if set to 0, xgboost model will not fit, and its prediction will not be calculated in the batch mode</td>
</tr>
<tr>
<td id="file-config-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-config-r-LC12" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-config-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-config-r-LC13" class="blob-code blob-code-inner js-file-line"># ensemble.R specific settings</td>
</tr>
<tr>
<td id="file-config-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-config-r-LC14" class="blob-code blob-code-inner js-file-line">cfg_run_ensemble <- 1 # if set to 0, the ensemble will not predict, and ensemble prediction will not be created</td>
</tr>
<tr>
<td id="file-config-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-config-r-LC15" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-config-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-config-r-LC16" class="blob-code blob-code-inner js-file-line"># ensemble components</td>
</tr>
<tr>
<td id="file-config-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-config-r-LC17" class="blob-code blob-code-inner js-file-line">cfg_model_predictions <- c("data/submission_LR.csv", "data/submission_GBM.csv", "data/submission_XGBOOST.csv")</td>
</tr>
<tr>
<td id="file-config-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-config-r-LC18" class="blob-code blob-code-inner js-file-line"># element weights mapped to the cfg_model_predictions elements above</td>
</tr>
<tr>
<td id="file-config-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-config-r-LC19" class="blob-code blob-code-inner js-file-line">cfg_model_weights <- c(1,1,1) # weights of predictions of the models in the ensemble</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee/raw/e274919657607fdfd67a2fb6354e40ff0c4173e9/config.R" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee#file-config-r" class="Link--inTextBlock">
config.R
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<h3 id="why-do-we-need-dvc" style="position:relative;">Why Do We Need DVC?<a href="#why-do-we-need-dvc" aria-label="why do we need dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>As we all know, there is no way to build the ideal ML model with sound
prediction accuracy from the very beginning. You will have to continuously
adjust your algorithm/model implementations based on the cross-validation
appraisal until you yield the blooming results. This is especially true in the
ensemble learning where you have to constantly tweak not only parameters of the
individual prediction models but also the settings of the ensemble itself</p>
<ul>
<li>
<p>changing ensemble composition — adding or removing individual prediction
models</p>
</li>
<li>
<p>changing model prediction weights in the resulting ensemble prediction</p>
</li>
</ul>
<p>Under such a condition, DVC will help you to manage your ensemble ML pipeline in
a really solid manner. Let’s consider the following real-world scenario</p>
<ul>
<li>
<p>Your team member changes the settings of <code>GBM</code> model and resubmit its
implementation to (this is emulated by the commit
<a href="https://github.com/gvyshnya/DVC_R_Ensemble/commit/27825d0732f72f07e7e4e48548ddb8a8604103f0" target="_blank" rel="nofollow noopener noreferrer">#8604103f0</a>,
check sum <code>27825d0</code>)</p>
</li>
<li>
<p>You rerun the entire ML pipeline on your computer, to get the newest
predictions from <code>GBM</code> as well as the updated final ensemble prediction</p>
</li>
<li>
<p>The results of the prediction appeared to be still not optimal thus someone
changes the weights of individual models in the ensemble, assigning <code>GBM</code>
higher weight vs. <code>xgboost</code> and <code>LR</code></p>
</li>
<li>
<p>After the ensemble setup changes committed (and updated <code>config.R</code> appeared in
the repository, as emulated by the commit
<a href="https://github.com/gvyshnya/DVC_R_Ensemble/commit/5bcbe115afcb24886abb4734ff2da42eb97612ce" target="_blank" rel="nofollow noopener noreferrer">#eb97612ce</a>,
check sum <code>5bcbe11</code>), you re-run the model predictions and the final ensemble
prediction on your machine once again</p>
</li>
</ul>
<p>All that you need to do to handle the changes above is simply to keep running
your <strong>DVC</strong> commands per the script developed (see the section below). You do
not have to remember or know explicitly the changes being made into the project
codebase or its pipeline configuration. <strong>DVC</strong> will automatically check out
latest changes from the repo as well as make sure it runs only those steps in
the pipeline that were affected by the recent changes in the code modules.</p>
<h3 id="orchestrating-the-pipeline--dvc-command-file" style="position:relative;">Orchestrating the Pipeline : DVC Command File<a href="#orchestrating-the-pipeline--dvc-command-file" aria-label="orchestrating the pipeline dvc command file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h3>
<p>After we developed individual R scripts needed by different steps of our Machine
Learning pipeline, we orchestrate it together using DVC.</p>
<p>Below is a batch file illustrating how DVC manages steps of the machine learning
process for this project</p>
<p></p><div id="gist73940214" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-dvc-bat" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-batchfile" style="overflow: auto" tabindex="0" role="region" aria-label="dvc.bat content, created by gvyshnya on 04:05PM on August 06, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc.bat">
<tbody><tr>
<td id="file-dvc-bat-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-dvc-bat-LC1" class="blob-code blob-code-inner js-file-line"># This is a DVC-based script to manage machine-learning pipeline for a project per</td>
</tr>
<tr>
<td id="file-dvc-bat-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-dvc-bat-LC2" class="blob-code blob-code-inner js-file-line"># https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/</td>
</tr>
<tr>
<td id="file-dvc-bat-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-dvc-bat-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-dvc-bat-LC4" class="blob-code blob-code-inner js-file-line">mkdir R_DVC_GITHUB_CODE</td>
</tr>
<tr>
<td id="file-dvc-bat-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-dvc-bat-LC5" class="blob-code blob-code-inner js-file-line">cd R_DVC_GITHUB_CODE</td>
</tr>
<tr>
<td id="file-dvc-bat-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-dvc-bat-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-dvc-bat-LC7" class="blob-code blob-code-inner js-file-line"># clone the github repo with the code</td>
</tr>
<tr>
<td id="file-dvc-bat-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-dvc-bat-LC8" class="blob-code blob-code-inner js-file-line">git clone https://github.com/gvyshnya/DVC_R_Ensemble</td>
</tr>
<tr>
<td id="file-dvc-bat-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-dvc-bat-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-dvc-bat-LC10" class="blob-code blob-code-inner js-file-line"># initialize DVC</td>
</tr>
<tr>
<td id="file-dvc-bat-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-dvc-bat-LC11" class="blob-code blob-code-inner js-file-line">$ dvc init</td>
</tr>
<tr>
<td id="file-dvc-bat-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-dvc-bat-LC12" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-dvc-bat-LC13" class="blob-code blob-code-inner js-file-line"># import data</td>
</tr>
<tr>
<td id="file-dvc-bat-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-dvc-bat-LC14" class="blob-code blob-code-inner js-file-line">$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine.csv data/</td>
</tr>
<tr>
<td id="file-dvc-bat-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-dvc-bat-LC15" class="blob-code blob-code-inner js-file-line">$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine_test.csv data/</td>
</tr>
<tr>
<td id="file-dvc-bat-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-dvc-bat-LC16" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-dvc-bat-LC17" class="blob-code blob-code-inner js-file-line"># run data pre-processing</td>
</tr>
<tr>
<td id="file-dvc-bat-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-dvc-bat-LC18" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/preprocessing.R data/wine.csv data/wine_test.csv data/training_imputed.csv data/testing_imputed.csv</td>
</tr>
<tr>
<td id="file-dvc-bat-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-dvc-bat-LC19" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-dvc-bat-LC20" class="blob-code blob-code-inner js-file-line"># run LR model fit and forecasting</td>
</tr>
<tr>
<td id="file-dvc-bat-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-dvc-bat-LC21" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/LR.R data/training_imputed.csv data/testing_imputed.csv 0.7 825 data/submission_LR.csv code/config.R</td>
</tr>
<tr>
<td id="file-dvc-bat-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-dvc-bat-LC22" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-dvc-bat-LC23" class="blob-code blob-code-inner js-file-line"># run GBM model fit and forecasting</td>
</tr>
<tr>
<td id="file-dvc-bat-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-dvc-bat-LC24" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 5000 10 4 25 data/submission_GBM.csv code/config.R</td>
</tr>
<tr>
<td id="file-dvc-bat-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-dvc-bat-LC25" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-dvc-bat-LC26" class="blob-code blob-code-inner js-file-line"># rum XGBOOST model fit and forecasting</td>
</tr>
<tr>
<td id="file-dvc-bat-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-dvc-bat-LC27" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 1000 10 0.0001 1.0 data/submission_xgboost.csv code/config.R</td>
</tr>
<tr>
<td id="file-dvc-bat-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-dvc-bat-LC28" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-bat-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-dvc-bat-LC29" class="blob-code blob-code-inner js-file-line"># prepare ensemble submission</td>
</tr>
<tr>
<td id="file-dvc-bat-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-dvc-bat-LC30" class="blob-code blob-code-inner js-file-line"># Note: please make sure to edit your code/config.R to set up the references to the predictions from each model according</td>
</tr>
<tr>
<td id="file-dvc-bat-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-dvc-bat-LC31" class="blob-code blob-code-inner js-file-line"># to the names of output files on the steps above</td>
</tr>
<tr>
<td id="file-dvc-bat-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
<td id="file-dvc-bat-LC32" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/ensemble.R data/submission_ensemble.csv code/config.R</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644/raw/4818eab6c2f99722110a37c7d2c509c78ce4240a/dvc.bat" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644#file-dvc-bat" class="Link--inTextBlock">
dvc.bat
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>If you then further edit ensemble configuration setup in <code>code/config.R</code>, you
can simply leverage the power of DVC as for automatic dependencies resolving and
tracking to rebuild the new ensemble prediction as follows</p>
<p></p><div id="gist74997297" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-dvc-repro-code" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-text" style="overflow: auto" tabindex="0" role="region" aria-label="dvc repro code content, created by gvyshnya on 07:22PM on August 20, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc repro code">
<tbody><tr>
<td id="file-dvc-repro-code-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-dvc-repro-code-LC1" class="blob-code blob-code-inner js-file-line"># Improve ensemble configuration</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-dvc-repro-code-LC2" class="blob-code blob-code-inner js-file-line">$ vi code/config.R</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-dvc-repro-code-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-dvc-repro-code-LC4" class="blob-code blob-code-inner js-file-line"># Commit all the changes.</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-dvc-repro-code-LC5" class="blob-code blob-code-inner js-file-line">$ git commit -am "Updated weights of the models in the ensemble"</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-dvc-repro-code-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-dvc-repro-code-LC7" class="blob-code blob-code-inner js-file-line"># Reproduce the ensemble prediction</td>
</tr>
<tr>
<td id="file-dvc-repro-code-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-dvc-repro-code-LC8" class="blob-code blob-code-inner js-file-line">$ dvc repro data/submission_ensemble.csv</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee/raw/4367adacf7f6d78ad223289c52737588441fabcb/dvc%20repro%20code" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee#file-dvc-repro-code" class="Link--inTextBlock">
dvc repro code
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In this blog post, we worked through the process of building an ensemble
prediction pipeline using DVC. The essential key features of that pipeline were
as follows</p>
<ul>
<li>
<p><strong><em>reproducibility</em></strong> — everybody on a team can run it on their premise</p>
</li>
<li>
<p><strong><em>separation of data and code</em></strong> — this ensured everyone always runs the
latest versions of the pipeline jobs with the most up-to-date ‘golden copy’ of
training and testing data sets</p>
</li>
</ul>
<p>The helpful side effect of using DVC was you stop keeping in mind what was
changed on every step of modifying your project scripts or in the pipeline
configuration. Due to it maintaining the dependencies graph (DAG) automatically,
it automatically triggered the only steps that were affected by the particular
changes, within the pipeline job setup. It, in turn, provides the capability to
quickly iterate through the entire ML pipeline.</p>
<blockquote>
<p>As DVC brings proven engineering practices to often suboptimal and messy ML
processes as well as helps a typical Data Science project team to eliminate a
big chunk of common
<a href="https://blog.dataversioncontrol.com/data-version-control-in-analytics-devops-paradigm-35a880e99133" target="_blank" rel="nofollow noopener noreferrer">DevOps overheads</a>,
I found it extremely useful to leverage DVC on the industrial data science and
predictive analytics projects.</p>
</blockquote>
<h2 id="further-reading" style="position:relative;">Further Reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<ol>
<li>
<p><a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank" rel="nofollow noopener noreferrer">Ensemble Learning and Prediction Introduction</a></p>
</li>
<li>
<p><a href="https://blog.dataversioncontrol.com/data-version-control-beta-release-iterative-machine-learning-a7faf7c8be67" target="_blank" rel="nofollow noopener noreferrer">Using DVC in Machine Learning projects in Python</a></p>
</li>
<li>
<p><a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">Using DVC in Machine Learning projects in R</a></p>
</li>
<li>
<p><a href="https://mlwave.com/kaggle-ensembling-guide/" target="_blank" rel="nofollow noopener noreferrer">Kaggle Ensembling Guide</a></p>
</li>
</ol>https://dvc.org/blog/data-version-control-in-analytics-devops-paradigmhttps://dvc.org/blog/data-version-control-in-analytics-devops-paradigmThu, 27 Jul 2017 00:00:00 GMT<h2 id="data-science-and-devops-convergence" style="position:relative;">Data Science and DevOps Convergence<a href="#data-science-and-devops-convergence" aria-label="data science and devops convergence permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The primary mission of DevOps is to help the teams to resolve various Tech Ops
infrastructure, tools and pipeline issues.</p>
<p>At the other hand, as mentioned in the conceptual review by
<a href="https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/" target="_blank" rel="nofollow noopener noreferrer">Forbes</a>
in November 2016, the industrial analytics is no more going to be driven by data
scientists alone. It requires an investment in DevOps skills, practices and
supporting technology to move analytics out of the lab and into the business.
There are even
<a href="https://www.computing.co.uk/ctg/news/2433095/a-lot-of-companies-will-stop-hiring-data-scientists-when-they-realise-that-the-majority-bring-no-value-says-data-scientist" target="_blank" rel="nofollow noopener noreferrer">voices</a>
calling Data Scientists to concentrate on agile methodology and DevOps if they
like to retain their jobs in business in the long run.</p>
<h2 id="why-devops-matters" style="position:relative;">Why DevOps Matters<a href="#why-devops-matters" aria-label="why devops matters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The eternal dream of almost every Data Scientist today is to spend all (well,
almost all) the time in the office exploring new datasets, engineering decisive
new features, inventing and validating cool new algorithms and strategies.
However, reality is often different. One of the unfortunate daily routines of a
Data Scientist work is to do raw data pre-processing. It usually translates to
the challenges to</p>
<ol>
<li>
<p><strong>Pull all kinds of necessary data from a variety of sources</strong></p>
<ul>
<li>
<p>Internal data sources like ERP, CRM, POS systems, or data from online
e-commerce platforms</p>
</li>
<li>
<p>External data, like weather, public holidays, Google trends etc.</p>
</li>
</ul>
</li>
<li>
<p><strong>Extract, transform, and load the data</strong></p>
<ul>
<li>
<p>Relate and join the data sources</p>
</li>
<li>
<p>Aggregate and transform the data</p>
</li>
</ul>
</li>
<li>
<p><strong>Avoid technical and performance drawbacks</strong> when everything ends up in
“one big table” at the end</p>
</li>
<li>
<p><strong>Facilitate continuous machine learning and decision-making in a
business-ready framework</strong></p>
<ul>
<li>
<p>Utilize historic data to train the machine learning models and algorithms</p>
</li>
<li>
<p>Use the current, up-to-date data for decision-making</p>
</li>
<li>
<p>Export back the resulting decisions/recommendations to review by business
stakeholders, either back into the ERP system or some other data warehouse</p>
</li>
</ul>
</li>
</ol>
<p>Another big challenge is to organize <strong>collaboration and data/model sharing</strong>
inside and across the boundaries of teams of Data Scientists and Software
Engineers.</p>
<p>DevOps skills as well as effective instruments will certainly be beneficial for
industrial Data Scientists as they can address the above-mentioned challenges in
a self-service manner.</p>
<h2 id="can-dvc-be-a-solution" style="position:relative;">Can DVC Be a Solution?<a href="#can-dvc-be-a-solution" aria-label="can dvc be a solution permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a> or simply DVC comes to the scene
whenever you start looking for effective DevOps-for-Analytics instruments.</p>
<p>DVC is an open source tool for data science projects. It makes your data science
projects reproducible by automatically building data dependency graph (DAG).
Your code and the dependencies could be easily shared by Git, and data — through
cloud storage (AWS S3, GCP) in a single DVC environment.</p>
<blockquote>
<p>Although DVC was created for machine learning developers and data scientists
<a href="https://dvc.org/doc/understanding-dvc/what-is-dvc" target="_blank" rel="nofollow noopener noreferrer">originally</a>, it appeared
to be useful beyond it. Since it brings proven engineering practices to not
well defined ML process, I discovered it to have enormous potential as an
Analytical DevOps instrument.</p>
</blockquote>
<p>It clearly helps to manage a big fraction of DevOps issues in daily Data
Scientist routines</p>
<ol>
<li>
<p><strong>Pull all kinds of necessary data from a variety of sources</strong>. Once you
configure and script your data extraction jobs with DVC, it will be
persistent and operable across your data and service infrastructure</p>
</li>
<li>
<p><strong>Extract, transform, and load the data</strong>. ETL is going to be easy and
repeatable once you configure it with DVC scripting. It will become a solid
pipeline to operate without major supportive effort. Moreover, it will track
all changes and trigger an alert for updates in the pipeline steps via DAG.</p>
</li>
<li>
<p><strong>Facilitate continuous machine learning and decision-making.</strong> The part of
the pipeline facilitated through DVC scripting can be jobs to upload data
back to any transactional system (like ERP, ERM, CRM etc.), warehouse or data
mart. It will then be exposed to business stakeholders to make intelligent
data-driven decisions.</p>
</li>
<li>
<p><strong>Share your algorithms and data</strong>. Machine Learning modeling is an iterative
process and it is extremely important to keep track of your steps,
dependencies between the steps, dependencies between your code and data files
and all code running arguments. This becomes even more important and
complicated in a team environment where data scientists’ collaboration takes
a serious amount of the team’s effort. DVC will be the arm to help you with
it.</p>
</li>
</ol>
<p>One of the ‘juicy’ features of DVC is ability to support multiple technology
stacks. Whether you prefer R or use promising Python-based implementations for
your industrial data products, DVC will be able to support your pipeline
properly. You can see it in action for both
<a href="https://blog.dvc.org/how-data-scientists-can-improve-their-productivity" target="_blank" rel="nofollow noopener noreferrer">Python-based</a>
and
<a href="https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">R-based</a>
technical stacks.</p>
<p>As such, DVC is going to be one of the tools you would enjoy to use if/when you
embark on building continual analytical environment for your system or across
your organization.</p>
<h2 id="continual-analytical-environment-and-devops" style="position:relative;">Continual Analytical Environment and DevOps<a href="#continual-analytical-environment-and-devops" aria-label="continual analytical environment and devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Building a production pipeline is quite different from building a
machine-learning prototype on a local laptop. Many teams and companies face the
challenges there.</p>
<p>At the bare minimum, the following requirements shall be met when you move your
solution into production</p>
<ol>
<li>
<p>Periodic re-training of the models/algorithms</p>
</li>
<li>
<p>Ease of re-deployment and configuration changes in the system</p>
</li>
<li>
<p>Efficiency and high performance of real-time scoring the new out-of-sample
observations</p>
</li>
<li>
<p>Availability of the monitor model performance over time</p>
</li>
<li>
<p>Adaptive ETL and ability to manage new data feeds and transactional systems
as data sources for AI and machine learning tools</p>
</li>
<li>
<p>Scaling to really big data operations</p>
</li>
<li>
<p>Security and Authorized access levels to different areas of the analytical
systems</p>
</li>
<li>
<p>Solid backup and recovery processes/tools</p>
</li>
</ol>
<p>This goes into the territory traditionally inhabited by DevOps. Data Scientists
should ideally learn to handle the part of those requirements themselves or at
least be informative consultants to classical DevOps gurus.</p>
<p>DVC can help in many aspects of the production scenario above as it can
orchestrate relevant tools and instruments through its scripting. In such a
setup, DVC scripts will be sharable manifestation (and implementation) of your
production pipeline where each step can be transparently reviewed, easily
maintained, and changed as needed over time.</p>
<h2 id="will-devops-be-captivating" style="position:relative;">Will DevOps Be Captivating?<a href="#will-devops-be-captivating" aria-label="will devops be captivating permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>If you are further interested in understanding the ever-proliferating role of
DevOps in the modern Data Science and predictive analytics in business, there
are good resources for your review below</p>
<ol>
<li>
<p><a href="https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/" target="_blank" rel="nofollow noopener noreferrer">DevOps For Data Science: Why Analytics Ops Is Key To Value</a>
(Forbes, Nov 14, 2016)</p>
</li>
<li>
<p><a href="https://www.packtpub.com/books/content/bridging-gap-between-data-science-and-devops" target="_blank" rel="nofollow noopener noreferrer">Bridging the Gap Between Data Science and DevOps</a></p>
</li>
<li>
<p><a href="https://devops.com/devops-life-better-data-scientists/" target="_blank" rel="nofollow noopener noreferrer">Is DevOps Making Life Better for Data Scientists?</a></p>
</li>
</ol>
<p>By any mean, DVC is going to be a useful instrument to fill the multiple gaps
between the classical in-lab old-school data science practices and growing
demands of business to build solid DevOps processes and workflows to streamline
mature and persistent data analytics.</p>https://dvc.org/blog/r-code-and-reproducible-model-development-with-dvchttps://dvc.org/blog/r-code-and-reproducible-model-development-with-dvcMon, 24 Jul 2017 00:00:00 GMT<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> or Data Version Control tool — its idea is to track
files/data dependencies during model development in order to facilitate
reproducibility and track data files versioning. Most of the
<a href="https://dvc.org/doc/tutorials" target="_blank" rel="nofollow noopener noreferrer">DVC tutorials</a> provide good examples of using
DVC with Python language. However, I realized that DVC is a
<a href="https://en.wikipedia.org/wiki/Language-agnostic" target="_blank" rel="nofollow noopener noreferrer">language agnostic</a> tool and
can be used with any programming language. In this blog post, we will see how to
use DVC in R projects.</p>
<h2 id="r-coding--keep-it-simple-and-readable" style="position:relative;">R coding — keep it simple and readable<a href="#r-coding--keep-it-simple-and-readable" aria-label="r coding keep it simple and readable permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Each development is always a combination of following steps presented below:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 342px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/921db/development-steps.png" alt="Model development process" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span>
<em>Model development process</em></p>
<p>Because of the specificity of the process — iterative development, it is very
important to improve some coding and organizational skills. For example, instead
of having one big R file with code it is better to split code in several logical
files — each responsible for one small piece of work. It is smart to track
history development with
<a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control" target="_blank" rel="nofollow noopener noreferrer">git</a>
tool. Writing “<em>reusable code”</em> is nice skill to have. Put comments in a code
can make our life easier.</p>
<p>Beside git, next step in further improvements is to try out and work with DVC.
Every time when a change/commit in some of the codes and data sets is made, DVC
will reproduce new results with just one bash command on a linux (or Win
environment). It memorizes dependencies among files and codes so it can easily
repeat all necessary steps/codes instead of us worrying about the order.</p>
<h2 id="r-example--data-and-code-clarification" style="position:relative;">R example — data and code clarification<a href="#r-example--data-and-code-clarification" aria-label="r example data and code clarification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>We’ll take an Python example from
<a href="https://dvc.org/doc/tutorials/deep" target="_blank" rel="nofollow noopener noreferrer">DVC tutorial</a> (written by Dmitry Petrov)
and rewrite that code in R. With an example we’ll show how can DVC help during
development and what are its possibilities.</p>
<p>Firstly, let’s initialize git and dvc on mentioned example and run our codes for
the first time. After that we will simulate some changes in the codes and see
how DVC works on reproducibility.</p>
<p>R codes can be downloaded from the
<a href="https://github.com/Zoldin/R_AND_DVC" target="_blank" rel="nofollow noopener noreferrer">Github repository</a>. A brief explanation of
the codes is presented below:</p>
<p><strong>parsingxml.R</strong> — it takes xml that we downloaded from the web and creates
appropriate csv file.</p>
<p></p><div id="gist71114089" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-parsingxml-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="parsingxml.R content, created by Zoldin on 08:40PM on July 21, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="parsingxml.R">
<tbody><tr>
<td id="file-parsingxml-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-parsingxml-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td>
</tr>
<tr>
<td id="file-parsingxml-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-parsingxml-r-LC2" class="blob-code blob-code-inner js-file-line">library(XML)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-parsingxml-r-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-parsingxml-r-LC4" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-parsingxml-r-LC5" class="blob-code blob-code-inner js-file-line">if (!length(args)==2) {</td>
</tr>
<tr>
<td id="file-parsingxml-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-parsingxml-r-LC6" class="blob-code blob-code-inner js-file-line"> stop("Two arguments must be supplied (input file name ,output file name - csv ext).n", call.=FALSE)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-parsingxml-r-LC7" class="blob-code blob-code-inner js-file-line">} </td>
</tr>
<tr>
<td id="file-parsingxml-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-parsingxml-r-LC8" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-parsingxml-r-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-parsingxml-r-LC10" class="blob-code blob-code-inner js-file-line">#read XML line by line</td>
</tr>
<tr>
<td id="file-parsingxml-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-parsingxml-r-LC11" class="blob-code blob-code-inner js-file-line">con <- file(args[1], "r")</td>
</tr>
<tr>
<td id="file-parsingxml-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-parsingxml-r-LC12" class="blob-code blob-code-inner js-file-line">lines <- readLines(con, -1)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-parsingxml-r-LC13" class="blob-code blob-code-inner js-file-line">test <- lapply(lines,function(x){return(xmlTreeParse(x,useInternalNodes = TRUE))})</td>
</tr>
<tr>
<td id="file-parsingxml-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-parsingxml-r-LC14" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-parsingxml-r-LC15" class="blob-code blob-code-inner js-file-line">#parsing XML to get variables</td>
</tr>
<tr>
<td id="file-parsingxml-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-parsingxml-r-LC16" class="blob-code blob-code-inner js-file-line">ID <- as.numeric(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Id"))}))</td>
</tr>
<tr>
<td id="file-parsingxml-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-parsingxml-r-LC17" class="blob-code blob-code-inner js-file-line">Tags <- sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Tags"))})</td>
</tr>
<tr>
<td id="file-parsingxml-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-parsingxml-r-LC18" class="blob-code blob-code-inner js-file-line">Title <- as.character(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Title"))}))</td>
</tr>
<tr>
<td id="file-parsingxml-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-parsingxml-r-LC19" class="blob-code blob-code-inner js-file-line">Body <- as.character(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Body"))}))</td>
</tr>
<tr>
<td id="file-parsingxml-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-parsingxml-r-LC20" class="blob-code blob-code-inner js-file-line">text = paste(Title,Body)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-parsingxml-r-LC21" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-parsingxml-r-LC22" class="blob-code blob-code-inner js-file-line">label = as.numeric(sapply(Tags,function(x){return(grep("python",x))}))</td>
</tr>
<tr>
<td id="file-parsingxml-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-parsingxml-r-LC23" class="blob-code blob-code-inner js-file-line">label[is.na(label)]=0</td>
</tr>
<tr>
<td id="file-parsingxml-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-parsingxml-r-LC24" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-parsingxml-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-parsingxml-r-LC25" class="blob-code blob-code-inner js-file-line">#final data frame for export</td>
</tr>
<tr>
<td id="file-parsingxml-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-parsingxml-r-LC26" class="blob-code blob-code-inner js-file-line">df <- as.data.frame(cbind(ID,label,text),stringsAsFactors = FALSE)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-parsingxml-r-LC27" class="blob-code blob-code-inner js-file-line">df$ID=as.numeric(df$ID)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-parsingxml-r-LC28" class="blob-code blob-code-inner js-file-line">df$label=as.numeric(df$label)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-parsingxml-r-LC29" class="blob-code blob-code-inner js-file-line">#write to csv</td>
</tr>
<tr>
<td id="file-parsingxml-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-parsingxml-r-LC30" class="blob-code blob-code-inner js-file-line">write.csv(df, file=args[2],row.names=FALSE)</td>
</tr>
<tr>
<td id="file-parsingxml-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-parsingxml-r-LC31" class="blob-code blob-code-inner js-file-line">print("output file created....")</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d/raw/98b259ade11132ad87e9c4f476b7561b184cf041/parsingxml.R" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d#file-parsingxml-r" class="Link--inTextBlock">
parsingxml.R
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p><strong>train_test_spliting.R</strong> — stratified sampling by target variable (here we are
creating test and train data set)</p>
<p></p><div id="gist71114469" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-train_test_splitting-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="train_test_splitting.R content, created by Zoldin on 08:42PM on July 21, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_test_splitting.R">
<tbody><tr>
<td id="file-train_test_splitting-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-train_test_splitting-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-train_test_splitting-r-LC2" class="blob-code blob-code-inner js-file-line">library(caret)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-train_test_splitting-r-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-train_test_splitting-r-LC4" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-train_test_splitting-r-LC5" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-train_test_splitting-r-LC6" class="blob-code blob-code-inner js-file-line">if (!length(args)==5) {</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-train_test_splitting-r-LC7" class="blob-code blob-code-inner js-file-line"> stop("Five arguments must be supplied (input file name, splitting ratio related to test data set, seed, train output file name, test output file name).n", call.=FALSE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-train_test_splitting-r-LC8" class="blob-code blob-code-inner js-file-line">} </td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-train_test_splitting-r-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-train_test_splitting-r-LC10" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-train_test_splitting-r-LC11" class="blob-code blob-code-inner js-file-line">set.seed(as.numeric(args[3]))</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-train_test_splitting-r-LC12" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-train_test_splitting-r-LC13" class="blob-code blob-code-inner js-file-line">df <- read.csv(args[1],stringsAsFactors = FALSE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-train_test_splitting-r-LC14" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-train_test_splitting-r-LC15" class="blob-code blob-code-inner js-file-line">test.index <- createDataPartition(df$label, p = as.numeric(args[2]), list = FALSE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-train_test_splitting-r-LC16" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-train_test_splitting-r-LC17" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-train_test_splitting-r-LC18" class="blob-code blob-code-inner js-file-line">train <- df[-test.index,]</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-train_test_splitting-r-LC19" class="blob-code blob-code-inner js-file-line">test <- df[test.index,]</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-train_test_splitting-r-LC20" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-train_test_splitting-r-LC21" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-train_test_splitting-r-LC22" class="blob-code blob-code-inner js-file-line">write.csv(train, file=args[4],row.names=FALSE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-train_test_splitting-r-LC23" class="blob-code blob-code-inner js-file-line">write.csv(test, file=args[5],row.names=FALSE)</td>
</tr>
<tr>
<td id="file-train_test_splitting-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-train_test_splitting-r-LC24" class="blob-code blob-code-inner js-file-line">print("train/test files created....")</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9/raw/e2106c39bad8a4ae04e41658bd287ea94ff7437a/train_test_splitting.R" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9#file-train_test_splitting-r" class="Link--inTextBlock">
train_test_splitting.R
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p><strong>featurization.R</strong> — text mining and tf-idf matrix creation. In this part we
are creating predictive variables.</p>
<p></p><div id="gist71113907" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-featurization-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="featurization.R content, created by Zoldin on 08:39PM on July 21, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="featurization.R">
<tbody><tr>
<td id="file-featurization-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-featurization-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td>
</tr>
<tr>
<td id="file-featurization-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-featurization-r-LC2" class="blob-code blob-code-inner js-file-line">library(text2vec)</td>
</tr>
<tr>
<td id="file-featurization-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-featurization-r-LC3" class="blob-code blob-code-inner js-file-line">library(MASS)</td>
</tr>
<tr>
<td id="file-featurization-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-featurization-r-LC4" class="blob-code blob-code-inner js-file-line">library(Matrix)</td>
</tr>
<tr>
<td id="file-featurization-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-featurization-r-LC5" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-featurization-r-LC6" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td>
</tr>
<tr>
<td id="file-featurization-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-featurization-r-LC7" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-featurization-r-LC8" class="blob-code blob-code-inner js-file-line">if (!length(args)==4) {</td>
</tr>
<tr>
<td id="file-featurization-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-featurization-r-LC9" class="blob-code blob-code-inner js-file-line"> stop("Four arguments must be supplied ( train file (csv format) ,test data set (csv format), train output file name and test output file name - txt files ).n", call.=FALSE)</td>
</tr>
<tr>
<td id="file-featurization-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-featurization-r-LC10" class="blob-code blob-code-inner js-file-line">} </td>
</tr>
<tr>
<td id="file-featurization-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-featurization-r-LC11" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-featurization-r-LC12" class="blob-code blob-code-inner js-file-line">#read input files</td>
</tr>
<tr>
<td id="file-featurization-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-featurization-r-LC13" class="blob-code blob-code-inner js-file-line">df_train = read.csv(args[1],stringsAsFactors = FALSE)</td>
</tr>
<tr>
<td id="file-featurization-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-featurization-r-LC14" class="blob-code blob-code-inner js-file-line">df_test = read.csv(args[2],stringsAsFactors = FALSE)</td>
</tr>
<tr>
<td id="file-featurization-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-featurization-r-LC15" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-featurization-r-LC16" class="blob-code blob-code-inner js-file-line">#create vocabulary - words</td>
</tr>
<tr>
<td id="file-featurization-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-featurization-r-LC17" class="blob-code blob-code-inner js-file-line">prep_fun = tolower</td>
</tr>
<tr>
<td id="file-featurization-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-featurization-r-LC18" class="blob-code blob-code-inner js-file-line">tok_fun = word_tokenizer</td>
</tr>
<tr>
<td id="file-featurization-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-featurization-r-LC19" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-featurization-r-LC20" class="blob-code blob-code-inner js-file-line">it_train = itoken(df_train$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_train$ID, progressbar = FALSE)</td>
</tr>
<tr>
<td id="file-featurization-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-featurization-r-LC21" class="blob-code blob-code-inner js-file-line">vocab = create_vocabulary(it_train,stopwords = stop_words)</td>
</tr>
<tr>
<td id="file-featurization-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-featurization-r-LC22" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-featurization-r-LC23" class="blob-code blob-code-inner js-file-line">#clean vocabualary - use only 5000 terms</td>
</tr>
<tr>
<td id="file-featurization-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-featurization-r-LC24" class="blob-code blob-code-inner js-file-line">pruned_vocab <- prune_vocabulary(vocab, max_number_of_terms=5000)</td>
</tr>
<tr>
<td id="file-featurization-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-featurization-r-LC25" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-featurization-r-LC26" class="blob-code blob-code-inner js-file-line">vectorizer = vocab_vectorizer(pruned_vocab)</td>
</tr>
<tr>
<td id="file-featurization-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-featurization-r-LC27" class="blob-code blob-code-inner js-file-line">dtm_train = create_dtm(it_train, vectorizer)</td>
</tr>
<tr>
<td id="file-featurization-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-featurization-r-LC28" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-featurization-r-LC29" class="blob-code blob-code-inner js-file-line">#create tf-idf for train data set</td>
</tr>
<tr>
<td id="file-featurization-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-featurization-r-LC30" class="blob-code blob-code-inner js-file-line">tfidf = TfIdf$new()</td>
</tr>
<tr>
<td id="file-featurization-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-featurization-r-LC31" class="blob-code blob-code-inner js-file-line">dtm_train_tfidf = fit_transform(dtm_train, tfidf)</td>
</tr>
<tr>
<td id="file-featurization-r-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
<td id="file-featurization-r-LC32" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
<td id="file-featurization-r-LC33" class="blob-code blob-code-inner js-file-line">#create test tf-idf - use vocabulary that is build on train</td>
</tr>
<tr>
<td id="file-featurization-r-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
<td id="file-featurization-r-LC34" class="blob-code blob-code-inner js-file-line">it_test = itoken(df_test$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_test$ID, progressbar = FALSE)</td>
</tr>
<tr>
<td id="file-featurization-r-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
<td id="file-featurization-r-LC35" class="blob-code blob-code-inner js-file-line">dtm_test_tfidf = create_dtm(it_test, vectorizer) %>% </td>
</tr>
<tr>
<td id="file-featurization-r-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
<td id="file-featurization-r-LC36" class="blob-code blob-code-inner js-file-line"> transform(tfidf)</td>
</tr>
<tr>
<td id="file-featurization-r-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
<td id="file-featurization-r-LC37" class="blob-code blob-code-inner js-file-line">#add Id as additional column in matrices</td>
</tr>
<tr>
<td id="file-featurization-r-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
<td id="file-featurization-r-LC38" class="blob-code blob-code-inner js-file-line">dtm_train_tfidf<- Matrix(cbind(label=df_train$label,dtm_train_tfidf),sparse = TRUE)</td>
</tr>
<tr>
<td id="file-featurization-r-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
<td id="file-featurization-r-LC39" class="blob-code blob-code-inner js-file-line">dtm_test_tfidf<- Matrix(cbind(label=df_test$label,dtm_test_tfidf),sparse = TRUE)</td>
</tr>
<tr>
<td id="file-featurization-r-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
<td id="file-featurization-r-LC40" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
<td id="file-featurization-r-LC41" class="blob-code blob-code-inner js-file-line"># write output - tf-idf matrices</td>
</tr>
<tr>
<td id="file-featurization-r-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
<td id="file-featurization-r-LC42" class="blob-code blob-code-inner js-file-line">writeMM(dtm_train_tfidf,args[3])</td>
</tr>
<tr>
<td id="file-featurization-r-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
<td id="file-featurization-r-LC43" class="blob-code blob-code-inner js-file-line">writeMM(dtm_test_tfidf,args[4])</td>
</tr>
<tr>
<td id="file-featurization-r-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
<td id="file-featurization-r-LC44" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-featurization-r-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
<td id="file-featurization-r-LC45" class="blob-code blob-code-inner js-file-line">print("Two matrices were created - one for train and one for test data set")</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6/raw/2787bc21fa8b2591ca09102f38f544eb5d6cf032/featurization.R" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6#file-featurization-r" class="Link--inTextBlock">
featurization.R
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p><strong>train_model.R</strong> — with created variables we are building logistic regression
(LASSO).</p>
<p></p><div id="gist71114340" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-train_model-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="train_model.R content, created by Zoldin on 08:41PM on July 21, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_model.R">
<tbody><tr>
<td id="file-train_model-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-train_model-r-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>!/usr/bin/Rscript</span></td>
</tr>
<tr>
<td id="file-train_model-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-train_model-r-LC2" class="blob-code blob-code-inner js-file-line">library(<span class="pl-smi">Matrix</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-train_model-r-LC3" class="blob-code blob-code-inner js-file-line">library(<span class="pl-smi">glmnet</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-train_model-r-LC4" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-train_model-r-LC5" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> three arguments needs to be provided - train file (.txt, matrix), seed and output name for RData file</span></td>
</tr>
<tr>
<td id="file-train_model-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-train_model-r-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-train_model-r-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-v">args</span> <span class="pl-k">=</span> commandArgs(<span class="pl-v">trailingOnly</span><span class="pl-k">=</span><span class="pl-c1">TRUE</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-train_model-r-LC8" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-train_model-r-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-k">if</span> (<span class="pl-k">!</span>length(<span class="pl-smi">args</span>)<span class="pl-k">==</span><span class="pl-c1">3</span>) {</td>
</tr>
<tr>
<td id="file-train_model-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-train_model-r-LC10" class="blob-code blob-code-inner js-file-line"> stop(<span class="pl-s"><span class="pl-pds">"</span>Three arguments must be supplied ( train file (.txt, matrix), seed and argument for RData model name).n<span class="pl-pds">"</span></span>, <span class="pl-v">call.</span><span class="pl-k">=</span><span class="pl-c1">FALSE</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-train_model-r-LC11" class="blob-code blob-code-inner js-file-line">} </td>
</tr>
<tr>
<td id="file-train_model-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-train_model-r-LC12" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-train_model-r-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>read train data set </span></td>
</tr>
<tr>
<td id="file-train_model-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-train_model-r-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-v">trainMM</span> <span class="pl-k">=</span> readMM(<span class="pl-smi">args</span>[<span class="pl-c1">1</span>])</td>
</tr>
<tr>
<td id="file-train_model-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-train_model-r-LC15" class="blob-code blob-code-inner js-file-line">set.seed(as.numeric(<span class="pl-smi">args</span>[<span class="pl-c1">2</span>]))</td>
</tr>
<tr>
<td id="file-train_model-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-train_model-r-LC16" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-train_model-r-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>use regular matrix, not sparse</span></td>
</tr>
<tr>
<td id="file-train_model-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-train_model-r-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-smi">trainMM_reg</span> <span class="pl-k"><-</span> as.matrix(<span class="pl-smi">trainMM</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-train_model-r-LC19" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-train_model-r-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-v">t1</span> <span class="pl-k">=</span> Sys.time()</td>
</tr>
<tr>
<td id="file-train_model-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-train_model-r-LC21" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>Started to train the model... <span class="pl-pds">"</span></span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-train_model-r-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-v">glmnet_classifier</span> <span class="pl-k">=</span> cv.glmnet(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">2</span><span class="pl-k">:</span><span class="pl-c1">500</span>], <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">1</span>], </td>
</tr>
<tr>
<td id="file-train_model-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-train_model-r-LC23" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">family</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>binomial<span class="pl-pds">'</span></span>, </td>
</tr>
<tr>
<td id="file-train_model-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-train_model-r-LC24" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> L1 penalty</span></td>
</tr>
<tr>
<td id="file-train_model-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-train_model-r-LC25" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">alpha</span> <span class="pl-k">=</span> <span class="pl-c1">1</span>,</td>
</tr>
<tr>
<td id="file-train_model-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-train_model-r-LC26" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> interested in the area under ROC curve</span></td>
</tr>
<tr>
<td id="file-train_model-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-train_model-r-LC27" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">type.measure</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">"</span>auc<span class="pl-pds">"</span></span>,</td>
</tr>
<tr>
<td id="file-train_model-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-train_model-r-LC28" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> 5-fold cross-validation</span></td>
</tr>
<tr>
<td id="file-train_model-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-train_model-r-LC29" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">nfolds</span> <span class="pl-k">=</span> <span class="pl-c1">5</span>,</td>
</tr>
<tr>
<td id="file-train_model-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-train_model-r-LC30" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> high value is less accurate, but has faster training</span></td>
</tr>
<tr>
<td id="file-train_model-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-train_model-r-LC31" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">thresh</span> <span class="pl-k">=</span> <span class="pl-c1">1e-3</span>,</td>
</tr>
<tr>
<td id="file-train_model-r-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
<td id="file-train_model-r-LC32" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> again lower number of iterations for faster training</span></td>
</tr>
<tr>
<td id="file-train_model-r-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
<td id="file-train_model-r-LC33" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">maxit</span> <span class="pl-k">=</span> <span class="pl-c1">1e3</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
<td id="file-train_model-r-LC34" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>Model generated...<span class="pl-pds">"</span></span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
<td id="file-train_model-r-LC35" class="blob-code blob-code-inner js-file-line">print(difftime(Sys.time(), <span class="pl-smi">t1</span>, <span class="pl-v">units</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>sec<span class="pl-pds">'</span></span>))</td>
</tr>
<tr>
<td id="file-train_model-r-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
<td id="file-train_model-r-LC36" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
<td id="file-train_model-r-LC37" class="blob-code blob-code-inner js-file-line"><span class="pl-v">preds</span> <span class="pl-k">=</span> predict(<span class="pl-smi">glmnet_classifier</span>, <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">2</span><span class="pl-k">:</span><span class="pl-c1">500</span>], <span class="pl-v">type</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>response<span class="pl-pds">'</span></span>)[,<span class="pl-c1">1</span>]</td>
</tr>
<tr>
<td id="file-train_model-r-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
<td id="file-train_model-r-LC38" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
<td id="file-train_model-r-LC39" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>AUC for the train... <span class="pl-pds">"</span></span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
<td id="file-train_model-r-LC40" class="blob-code blob-code-inner js-file-line"><span class="pl-e">glmnet</span><span class="pl-k">:::</span>auc(<span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">1</span>], <span class="pl-smi">preds</span>)</td>
</tr>
<tr>
<td id="file-train_model-r-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
<td id="file-train_model-r-LC41" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-train_model-r-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
<td id="file-train_model-r-LC42" class="blob-code blob-code-inner js-file-line">save(<span class="pl-smi">glmnet_classifier</span>,<span class="pl-v">file</span><span class="pl-k">=</span><span class="pl-smi">args</span>[<span class="pl-c1">3</span>])</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf/raw/5f12bfcec59aeddd8428f9d9c571a243c2302ae6/train_model.R" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf#file-train_model-r" class="Link--inTextBlock">
train_model.R
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p><strong>evaluate.R</strong> — with trained model we are predicting target on test data set.
AUC is final output which is used as evaluation metric.</p>
<p></p><div id="gist71113477" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-evaluate-r" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="evaluate.r content, created by Zoldin on 08:37PM on July 21, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="evaluate.r">
<tbody><tr>
<td id="file-evaluate-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-evaluate-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td>
</tr>
<tr>
<td id="file-evaluate-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-evaluate-r-LC2" class="blob-code blob-code-inner js-file-line">library(Matrix)</td>
</tr>
<tr>
<td id="file-evaluate-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-evaluate-r-LC3" class="blob-code blob-code-inner js-file-line">library(glmnet)</td>
</tr>
<tr>
<td id="file-evaluate-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-evaluate-r-LC4" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-evaluate-r-LC5" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td>
</tr>
<tr>
<td id="file-evaluate-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-evaluate-r-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-evaluate-r-LC7" class="blob-code blob-code-inner js-file-line">if (!length(args)==3) {</td>
</tr>
<tr>
<td id="file-evaluate-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-evaluate-r-LC8" class="blob-code blob-code-inner js-file-line"> stop("Three arguments must be supplied ( file name where model is stored (RDataname), test file (.txt, matrix) and file name for AUC output).n", call.=FALSE)</td>
</tr>
<tr>
<td id="file-evaluate-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-evaluate-r-LC9" class="blob-code blob-code-inner js-file-line">} </td>
</tr>
<tr>
<td id="file-evaluate-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-evaluate-r-LC10" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-evaluate-r-LC11" class="blob-code blob-code-inner js-file-line">#read test data set and model </td>
</tr>
<tr>
<td id="file-evaluate-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-evaluate-r-LC12" class="blob-code blob-code-inner js-file-line">load(args[1])</td>
</tr>
<tr>
<td id="file-evaluate-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-evaluate-r-LC13" class="blob-code blob-code-inner js-file-line">testMM = readMM(args[2])</td>
</tr>
<tr>
<td id="file-evaluate-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-evaluate-r-LC14" class="blob-code blob-code-inner js-file-line">testMM_reg <- as.matrix(testMM)</td>
</tr>
<tr>
<td id="file-evaluate-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-evaluate-r-LC15" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-evaluate-r-LC16" class="blob-code blob-code-inner js-file-line">#predict test data</td>
</tr>
<tr>
<td id="file-evaluate-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-evaluate-r-LC17" class="blob-code blob-code-inner js-file-line">preds = predict(glmnet_classifier, testMM_reg[,2:500] , type = 'response')[, 1]</td>
</tr>
<tr>
<td id="file-evaluate-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-evaluate-r-LC18" class="blob-code blob-code-inner js-file-line"> glmnet:::auc(testMM_reg[,1], preds)</td>
</tr>
<tr>
<td id="file-evaluate-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-evaluate-r-LC19" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-evaluate-r-LC20" class="blob-code blob-code-inner js-file-line">#write AUC into txt file</td>
</tr>
<tr>
<td id="file-evaluate-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-evaluate-r-LC21" class="blob-code blob-code-inner js-file-line">write.table(file=args[3],paste('AUC for the test file is : ',glmnet:::auc(testMM_reg[,1], preds)),row.names = FALSE,col.names = FALSE)</td>
</tr>
<tr>
<td id="file-evaluate-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-evaluate-r-LC22" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-evaluate-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-evaluate-r-LC23" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d/raw/8044bf4a8bf9301113705332f6a26936bd89445b/evaluate.r" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d#file-evaluate-r" class="Link--inTextBlock">
evaluate.r
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<p>Firstly, codes from above we will download into the new folder (clone the
repository):</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">mkdir</span> R_DVC_GITHUB_CODE
</span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> R_DVC_GITHUB_CODE
</span>
<span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> https://github.com/Zoldin/R_AND_DVC</span></code></pre></div>
<h2 id="dvc-installation-and-initialization" style="position:relative;">DVC installation and initialization<a href="#dvc-installation-and-initialization" aria-label="dvc installation and initialization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>On the first site it seemed that DVC will not be compatible to work with R
because of the fact that DVC is written in Python and as that needs/requires
Python packages and pip package manager. Nevertheless, the tool can be used with
any programming language, it is language agnostic and as such is excellent for
working with R.</p>
<p>Dvc installation:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip3</span> <span class="token function">install</span> dvc
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div>
<p>With code below 5 R scripts with <code>dvc run</code> are executed. Each script is started
with some arguments — input and output file names and other parameters (seed,
splitting ratio etc). It is important to use <code>dvc run</code> — with this command R
script are entering pipeline (DAG graph).</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://s3-us-west-2.amazonaws.com/dvc-public/data/tutorial/nlp/25K/Posts.xml.zip <span class="token punctuation">\</span>
data/
</span>
<span class="token comment"># Extract XML from the archive.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token function">tar</span> zxf data/Posts.xml.tgz <span class="token parameter variable">-C</span> data/
</span>
<span class="token comment"># Prepare data.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/parsingxml.R <span class="token punctuation">\</span>
data/Posts.xml <span class="token punctuation">\</span>
data/Posts.csv
</span>
<span class="token comment"># Split training and testing dataset. Two output files.</span>
<span class="token comment"># 0.33 is the test dataset splitting ratio.</span>
<span class="token comment"># 20170426 is a seed for randomization.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span>
data/Posts.csv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/train_post.csv <span class="token punctuation">\</span>
data/test_post.csv
</span>
<span class="token comment"># Extract features from text data.</span>
<span class="token comment"># Two TSV inputs and two pickle matrices outputs.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/featurization.R <span class="token punctuation">\</span>
data/train_post.csv <span class="token punctuation">\</span>
data/test_post.csv <span class="token punctuation">\</span>
data/matrix_train.txt <span class="token punctuation">\</span>
data/matrix_test.txt
</span>
<span class="token comment"># Train ML model out of the training dataset.</span>
<span class="token comment"># 20170426 is another seed value.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_model.R <span class="token punctuation">\</span>
data/matrix_train.txt <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/glmnet.Rdata
</span>
<span class="token comment"># Evaluate the model by the testing dataset.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/evaluate.R <span class="token punctuation">\</span>
data/glmnet.Rdata <span class="token punctuation">\</span>
data/matrix_test.txt <span class="token punctuation">\</span>
data/evaluation.txt
</span>
<span class="token comment"># The result.</span>
<span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation.txt</span></code></pre></div>
<h2 id="dependency-flow-graph-on-r-example" style="position:relative;">Dependency flow graph on R example<a href="#dependency-flow-graph-on-r-example" aria-label="dependency flow graph on r example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>Dependency graph is shown on picture below:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 256.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e9ba609b030acd01d27fcd1ff99a3f7f/bb9ec/dependency-graph.png" alt="Dependency graph" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Dependency
graph</em></p>
<p>DVC memorizes this dependencies and helps us in each moment to reproduce
results.</p>
<p>For example, lets say that we are changing our training model — using ridge
penalty instead of lasso penalty (changing alpha parameter to <code>0</code>). In that case
will change/modify <code>train_model.R</code> job and if we want to repeat model
development with this algorithm we don’t need to repeat all steps from above,
only steps marked red on a picture below:</p>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 256.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/da29b8bd00ccba3578fdfe91cd7f34bc/bb9ec/marked-steps.png" alt="marked steps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>DVC knows based on DAG graph that changed <code>train_model.R</code> file will only change
following files: <code>Glmnet.RData</code> and <code>Evaluation.txt</code>. If we want to see our new
results we need to execute only <code>train_model.R</code> and <code>evaluate.R job</code>. It is cool
that we don’t have to think all the time what we need to repeat (which steps).
<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command will do that instead of us. Here is a code example :</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train_model.R
</span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-am</span> <span class="token string">"Ridge penalty instead of lasso"</span>
</span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation.txt
</span>
Reproducing run command for data item data/glmnet.Rdata. Args: Rscript code/train_model.R data/matrix_train.txt 20170426 data/glmnet.Rdata
Reproducing run command for data item data/evaluation.txt. Args: Rscript code/evaluate.R data/glmnet.Rdata data/matrix_test.txt data/evaluation.txt
<span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation.txt
</span>"AUC for the test file is : 0.947697381983095"</code></pre></div>
<p><a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> always re executes steps which are affected with the latest
developer changes. It knows what needs to be reproduced.</p>
<p>DVC can also work in an <em>"multi-user environment”</em> . Pipelines (dependency
graphs) are visible to others colleagues if we are working in a team and using
git as our version control tool. Data files can be shared if we set up a cloud
and with <em>dvc sync</em> we specify which data can be shared and used for other
users. In that case other users can see the shared data and reproduce results
with those data and their code changes.</p>
<h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>DVC tool improves and accelerates iterative development and helps to keep track
of ML processes and file dependencies in the simple form. On the R example we
saw how DVC memorizes dependency graph and based on that graph re executes only
jobs that are related to the latest changes. It can also work in multi-user
environment where dependency graphs, codes and data can be shared among multiple
users. Because it is language agnostic, DVC allows us to work with multiple
programming languages within a single data science project.</p>https://dvc.org/blog/how-data-scientists-can-improve-their-productivityhttps://dvc.org/blog/how-data-scientists-can-improve-their-productivityMon, 15 May 2017 00:00:00 GMT<p>Data science and machine learning are iterative processes. It is never possible
to successfully complete a data science project in a single pass. A data
scientist constantly tries new ideas and changes steps of her pipeline:</p>
<ol>
<li>
<p>extract new features and accidentally find noise in the data;</p>
</li>
<li>
<p>clean up the noise, find one more promising feature;</p>
</li>
<li>
<p>extract the new feature;</p>
</li>
<li>
<p>rebuild and validate the model, realize that the learning algorithm
parameters are not perfect for the new feature set;</p>
</li>
<li>
<p>change machine learning algorithm parameters and retrain the model;</p>
</li>
<li>
<p>find the ineffective feature subset and remove it from the feature set;</p>
</li>
<li>
<p>try a few more new features;</p>
</li>
<li>
<p>try another ML algorithm. And then a data format change is required.</p>
</li>
</ol>
<p>This is only a small episode in a data scientist’s daily life and it is what
makes our job different from a regular engineering job.</p>
<p>Business context, ML algorithm knowledge and intuition all help you to find a
good model faster. But you never know for sure what ideas will bring you the
best value.</p>
<p>This is why the iteration time is a critical parameter in data science process.
The quicker you iterate, the more you can check ideas and build a better model.</p>
<blockquote>
<p>“A well-engineered pipeline gets data scientists iterating much faster, which
can be a big competitive edge” From
<a href="http://blog.untrod.com/2012/10/engineering-practices-in-data-science.html" target="_blank" rel="nofollow noopener noreferrer">Engineering Practices in Data Science</a>
By Chris Clark.</p>
</blockquote>
<h2 id="a-data-science-iteration-tool" style="position:relative;">A data science iteration tool<a href="#a-data-science-iteration-tool" aria-label="a data science iteration tool permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>To speed up the iterations in data science projects we have created an open
source tool <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">data version control</a> or <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a>.</p>
<p>DVC takes care of dependencies between commands that you run, generated data
files, and code files and allows you to easily reproduce any steps of your
research with regards to files changes.</p>
<p>You can think about DVC as a Makefile for a data science project even though you
do not create a file explicitly. DVC tracks dependencies in your data science
projects when you run data processing or modeling code through a special
command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python code/xml_to_tsv.py <span class="token punctuation">\</span>
data/Posts.xml data/Posts.tsv</span></code></pre></div>
<p><code>dvc run</code> works as a proxy for your commands. This allows DVC to track input and
output files, construct the dependency graph
(<a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph" target="_blank" rel="nofollow noopener noreferrer">DAG</a>), and store the
command and parameters for a future command reproduction.</p>
<p>The previous command will be automatically piped with the next command because
of the file <code>data/Posts.tsv</code> is an output for the previous command and the input
for the next one:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Split training and testing dataset. Two output files.</span>
<span class="token comment"># 0.33 is the test dataset splitting ratio.</span>
<span class="token comment"># 20170426 is a seed for randomization.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python code/split_train_test.py <span class="token punctuation">\</span>
data/Posts.tsv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span>
data/Posts-train.tsv data/Posts-test.tsv</span></code></pre></div>
<p>DVC derives the dependencies automatically by looking to the list of the
parameters (even if your code ignores the parameters) and noting the file
changes before and after running the command.</p>
<p>If you change one of your dependencies (data or code) then all the affected
steps of the pipeline will be reproduced:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Change the data preparation code.</span>
<span class="token line"><span class="token input">$ </span><span class="token command">vi</span> code/xml_to_tsv.py
</span>
<span class="token comment"># Reproduce.</span>
<span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/Posts-train.tsv
</span>Reproducing run command for data item data/Posts.tsv.
Reproducing run command for data item data/Posts-train.tsv.</code></pre></div>
<p>This is a powerful way of quickly iterating through your pipeline.</p>
<p>The pipeline might have a lot of steps and forms of acyclic dependencies between
the steps. Below is an example of a canonical machine learning pipeline (more
details in <a href="https://dvc.org/doc/tutorials" target="_blank" rel="nofollow noopener noreferrer">the DVC tutorials</a>:</p>
<p></p><div id="gist47206784" class="gist">
<div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container">
<div id="file-dvc_pipeline-sh" class="file my-2">
<div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="dvc_pipeline.sh content, created by dmpetrov on 07:11AM on April 30, 2017.">
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">
<template class="js-file-alert-template">
<div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
<span>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
<a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
</span>
<div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters
</a>
</div>
</div></template>
<template class="js-line-alert-template">
<span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e">
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert">
<path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path>
</svg>
</span></template>
<table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc_pipeline.sh">
<tbody><tr>
<td id="file-dvc_pipeline-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
<td id="file-dvc_pipeline-sh-LC1" class="blob-code blob-code-inner js-file-line"># Install DVC</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
<td id="file-dvc_pipeline-sh-LC2" class="blob-code blob-code-inner js-file-line">$ pip install dvc</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
<td id="file-dvc_pipeline-sh-LC3" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
<td id="file-dvc_pipeline-sh-LC4" class="blob-code blob-code-inner js-file-line"># Initialize DVC repository</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
<td id="file-dvc_pipeline-sh-LC5" class="blob-code blob-code-inner js-file-line">$ dvc init</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
<td id="file-dvc_pipeline-sh-LC6" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
<td id="file-dvc_pipeline-sh-LC7" class="blob-code blob-code-inner js-file-line"># Download a file and put to data/ directory.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
<td id="file-dvc_pipeline-sh-LC8" class="blob-code blob-code-inner js-file-line">$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
<td id="file-dvc_pipeline-sh-LC9" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
<td id="file-dvc_pipeline-sh-LC10" class="blob-code blob-code-inner js-file-line"># Extract XML from the archive.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
<td id="file-dvc_pipeline-sh-LC11" class="blob-code blob-code-inner js-file-line">$ dvc run tar zxf data/Posts.xml.tgz -C data/</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
<td id="file-dvc_pipeline-sh-LC12" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
<td id="file-dvc_pipeline-sh-LC13" class="blob-code blob-code-inner js-file-line"># Prepare data.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
<td id="file-dvc_pipeline-sh-LC14" class="blob-code blob-code-inner js-file-line">$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
<td id="file-dvc_pipeline-sh-LC15" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
<td id="file-dvc_pipeline-sh-LC16" class="blob-code blob-code-inner js-file-line"># Split training and testing dataset. Two output files.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
<td id="file-dvc_pipeline-sh-LC17" class="blob-code blob-code-inner js-file-line"># 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
<td id="file-dvc_pipeline-sh-LC18" class="blob-code blob-code-inner js-file-line">$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
<td id="file-dvc_pipeline-sh-LC19" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
<td id="file-dvc_pipeline-sh-LC20" class="blob-code blob-code-inner js-file-line"># Extract features from text data. Two TSV inputs and two pickle matrixes outputs.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
<td id="file-dvc_pipeline-sh-LC21" class="blob-code blob-code-inner js-file-line">$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
<td id="file-dvc_pipeline-sh-LC22" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
<td id="file-dvc_pipeline-sh-LC23" class="blob-code blob-code-inner js-file-line"># Train ML model out of the training dataset. 20170426 is another seed value.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
<td id="file-dvc_pipeline-sh-LC24" class="blob-code blob-code-inner js-file-line">$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
<td id="file-dvc_pipeline-sh-LC25" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
<td id="file-dvc_pipeline-sh-LC26" class="blob-code blob-code-inner js-file-line"># Evaluate the model by the testing dataset.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
<td id="file-dvc_pipeline-sh-LC27" class="blob-code blob-code-inner js-file-line">$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
<td id="file-dvc_pipeline-sh-LC28" class="blob-code blob-code-inner js-file-line">
</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
<td id="file-dvc_pipeline-sh-LC29" class="blob-code blob-code-inner js-file-line"># The result.</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
<td id="file-dvc_pipeline-sh-LC30" class="blob-code blob-code-inner js-file-line">$ cat data/evaluation.txt</td>
</tr>
<tr>
<td id="file-dvc_pipeline-sh-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
<td id="file-dvc_pipeline-sh-LC31" class="blob-code blob-code-inner js-file-line">AUC: 0.596182</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
</div>
<div class="gist-meta">
<a href="https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6/raw/166cf09a233861902f1765e9179c1dce556fdcf5/dvc_pipeline.sh" style="float:right" class="Link--inTextBlock">view raw</a>
<a href="https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6#file-dvc_pipeline-sh" class="Link--inTextBlock">
dvc_pipeline.sh
</a>
hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
</div>
</div>
</div><p></p>
<h2 id="why-are-regular-pipeline-tools-not-enough" style="position:relative;">Why are regular pipeline tools not enough?<a href="#why-are-regular-pipeline-tools-not-enough" aria-label="why are regular pipeline tools not enough permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<blockquote>
<p>“Workflows are expected to be mostly static or slowly changing.” (See
<a href="https://airflow.incubator.apache.org/" target="_blank" rel="nofollow noopener noreferrer">Airflow</a>.)</p>
</blockquote>
<p>Regular pipeline tools like <a href="http://airflow.incubator.apache.org" target="_blank" rel="nofollow noopener noreferrer">Airflow</a> and
<a href="https://github.com/spotify/luigi" target="_blank" rel="nofollow noopener noreferrer">Luigi</a> are good for representing static and
fault tolerant workflows. A huge portion of their functionality is created for
monitoring, optimization and fault tolerance. These are very important and
business critical problems. However, these problems are irrelevant to data
scientists’ daily lives.</p>
<p>Data scientists need a lightweight, dynamic workflow management system. In
contrast to the traditional airflow-like system, DVC reflects the process of
researching and looking for a great model (and pipeline), not optimizing and
monitoring an existing one. This is why DVC is a good fit for iterative machine
learning processes. When a good model was discovered with DVC, the result could
be incorporated into a data engineering pipeline (Luigi or Airflow).</p>
<h2 id="pipelines-and-data-sharing" style="position:relative;">Pipelines and data sharing<a href="#pipelines-and-data-sharing" aria-label="pipelines and data sharing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>In addition to pipeline description, data reproduction and dynamic nature, DVC
has one more important feature. It was designed in accordance with the best
software engineering practices. DVC is based on Git. It keeps code, and stores
DAG in the Git repository which allows you to share your research results. But
it moves the actual file content outside the Git repository (in <code>.cache</code>
directory which DVC includes in <code>.gitignore</code>) since Git is not designed to
accommodate large data files.</p>
<p>The data files can be shared between data scientists through cloud storages
using a simple command:</p>
<div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Data scientists 1 syncs data to the cloud.</span>
<span class="token line"><span class="token input">$ </span><span class="token command">dvc</span> <span class="token function">sync</span> data/</span></code></pre></div>
<p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 307px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6890171452971f3e3cd847014a526e03/7fc5b/git-server-or-github.jpg" alt="git server or github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p>
<p>Currently, AWS S3 and GCP storage are supported by DVC.</p>
<h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg>
</a></h2>
<p>The productivity of data scientists can be improved by speeding up iteration
processes and the DVC tool takes care of this.</p>
<p>We are very interested in your opinion and feedback. Please post your comments
here or contact us on Twitter — <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">FullStackML</a>.</p>
<p>If you found this tool useful, <strong>please “star” the
<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Github repository</a></strong>.</p>