Bringing Automation To Data Labeling For Machine Learning With Watchful
Data Engineering Podcast - Un pódcast de Tobias Macey - Domingos
Categorías:
Summary Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Shayan Mohanty about Watchful, a data-centric platform for labeling your machine learning inputs Interview Introduction How did you get involved in the area of data management? Can you describe what Watchful is and the story behind it? What are your core goals at Watchful? What problem are you solving and who are the people most impacted by that problem? What is the role of the data engineer in the process of getting data labeled for machine learning projects? Data labeling is a large and competitive market. How do you characterize the different approaches offered by the various platforms and services? What are the main points of friction involved in getting data labeled? How do the types of data and its applications factor into how those challenges manifest? What does Watchful provide that allows it to address those obstacles? Can you describe how Watchful is implemented? What are some of the initial ideas/assumptions that you have had to re-evaluate? What are some of the ways that you have had to adjust the design of your user experience flows since you first started? What is the workflow for teams who are adopting Watchful? What are the types of collaboration that need to happen in the data labeling process? What are some of the elements of shared vocabulary that different stakeholders in the process need to establish to be successful? What are the most interesting, innovative, or unexpected ways that you have seen Watchful used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Watchful? When is Watchful the wrong choice? What do you have planned for the future of Watchful? Contact Info LinkedIn @shayanjm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Watchful Entity Resolution Supervised Machine Learning BERT CLIP LabelBox Label Studio Snorkel AI Machine Learning Podcast Episode RegEx == Regular Expression REPL == Read Evaluate Print Loop IDE == Integrated Development Environment Turing Completeness Clojure Rust Named Entity Recognition The Halting Problem NP Hard Lidar Shayan: Arguments Against Hand Labeling The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast