Labeled data
{{Short description|Group of samples that have been tagged with one or more labels}} {{Multiple issues|{{more citations needed | date = May 2017}} {{Tone|date=April 2024}}}} {{Machine learning bar}}'''Labeled data''' is a group of [[Sample (statistics)|samples]] that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags called '''judgments'''. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor.
Labels can be obtained by having humans make judgments about a given piece of unlabeled data.{{Cite web |title=What is Data Labeling? - Data Labeling Explained - AWS |url=https://aws.amazon.com/what-is/data-labeling/ |access-date=2024-07-16 |website=Amazon Web Services, Inc. |language=en-US}} Labeled data is significantly more expensive to obtain than the raw unlabeled data.
The quality of labeled data directly influences the performance of [[Supervised learning|supervised machine learning]] models in operation, as these models learn from the provided labels.{{Citation |last1=Fredriksson |first1=Teodor |title=Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies |date=2020 |work=Product-Focused Software Process Improvement |volume=12562 |pages=202–216 |editor-last=Morisio |editor-first=Maurizio |url=https://link.springer.com/10.1007/978-3-030-64148-1_13 |access-date=2024-07-13 |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-64148-1_13 |isbn=978-3-030-64147-4 |last2=Mattos |first2=David Issa |last3=Bosch |first3=Jan |last4=Olsson |first4=Helena Holmström |editor2-last=Torchiano |editor2-first=Marco |editor3-last=Jedlitschka |editor3-first=Andreas|url-access=subscription }}
==Crowdsourced labeled data== In 2006, [[Fei-Fei Li]], the co-director of the [[Stanford University|Stanford]] Human-Centered AI Institute, initiated research to improve the [[artificial intelligence]] models and algorithms for image recognition by significantly enlarging the [[training data]]. The researchers downloaded millions of images from the [[World Wide Web]] and a team of undergraduates started to apply labels for objects to each image. In 2007, Li outsourced the data labeling work on [[Amazon Mechanical Turk]], an [[online marketplace]] for digital [[piece work]]. The 3.2 million images that were labeled by more than 49,000 workers formed the basis for [[ImageNet]], one of the largest hand-labeled database for [[outline of object recognition]].{{Cite book|author1=Mary L. Gray |author2=Siddharth Suri |title=Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass|publisher=Houghton Mifflin Harcourt|year=2019|isbn=978-1-328-56628-7|page=7}}
==Automated data labelling== After obtaining a labeled dataset, [[machine learning]] models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.Johnson, Leif. [https://stackoverflow.com/a/19172720 "What is the difference between labeled and unlabeled data?"], ''[[Stack Overflow]]'', 4 October 2013. Retrieved on 13 May 2017. {{CC-notice|cc=bysa3|url=https://stackoverflow.com/a/19172720|author=[https://stackoverflow.com/users/2014584/lmjohns3 lmjohns3]}}
==Challenges==
=== Data-driven bias === [[Algorithmic decision making|Algorithmic decision-making]] is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a [[predictive model]], despite the machine learning algorithm being legitimate. The labeled data used to train a specific machine learning algorithm needs to be a statistically [[representative sample]] to not bias the results.{{Cite book |author=Xianhong Hu |title=Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective |author2=Bhanu Neupane |author3=Lucia Flores Echaiz |author4=Prateek Sibal |author5=Macarena Rivera Lam |publisher=UNESCO Publishing |year=2019 |isbn=978-92-3-100363-9 |page=64}} For example, in [[facial recognition system]]s underrepresented groups are subsequently often misclassified if the labeled data available to train has not been representative of the population,. In 2018, a study by [[Joy Buolamwini]] and [[Timnit Gebru]] demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.{{Cite book |author=Xianhong Hu |title=Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective |author2=Bhanu Neupane |author3=Lucia Flores Echaiz |author4=Prateek Sibal |author5=Macarena Rivera Lam |publisher=UNESCO Publishing |year=2019 |isbn=978-92-3-100363-9 |page=66}}
=== Human error and inconsistency === Human annotators are prone to errors and biases when labeling data. This can lead to inconsistent labels and affect the quality of the data set. The inconsistency can affect the [[machine learning]] model's ability to generalize well.{{Cite journal |last1=Geiger |first1=R. Stuart |last2=Cope |first2=Dominique |last3=Ip |first3=Jamie |last4=Lotosh |first4=Marsha |last5=Shah |first5=Aayush |last6=Weng |first6=Jenny |last7=Tang |first7=Rebekah |date=2021-11-05 |title="Garbage in, garbage out" revisited: What do machine learning application papers report about human-labeled training data? |url=https://direct.mit.edu/qss/article/2/3/795/102771/Garbage-in-garbage-out-revisited-What-do-machine |journal=Quantitative Science Studies |language=en |volume=2 |issue=3 |pages=795–827 |doi=10.1162/qss_a_00144 |doi-access=free|issn=2641-3337|arxiv=2107.02278 }}
=== Domain expertise === Certain fields, such as legal document analysis or [[medical imaging]], require annotators with specialized domain knowledge. Without the expertise, the annotations or labeled data may be inaccurate, negatively impacting the machine learning model's performance in a real-world scenario.{{Cite journal |last1=Alzubaidi |first1=Laith |last2=Bai |first2=Jinshuai |last3=Al-Sabaawi |first3=Aiman |last4=Santamaría |first4=Jose |last5=Albahri |first5=A. S. |last6=Al-dabbagh |first6=Bashar Sami Nayyef |last7=Fadhel |first7=Mohammed A. |last8=Manoufali |first8=Mohamed |last9=Zhang |first9=Jinglan |last10=Al-Timemy |first10=Ali H. |last11=Duan |first11=Ye |last12=Abdullah |first12=Amjed |last13=Farhan |first13=Laith |last14=Lu |first14=Yi |last15=Gupta |first15=Ashish |date=2023-04-14 |title=A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications |journal=Journal of Big Data |volume=10 |issue=1 |pages=46 |doi=10.1186/s40537-023-00727-2 |doi-access=free |issn=2196-1115}}
==See also==
- [[Data annotation]]
- [[Humans in the Loop (film)|''Humans in the Loop'' (film)]]
==References== {{Reflist}}
[[Category:Machine learning]]
Related Articles
From MOAI Insights

공장의 뇌는 어떻게 생겼는가 — 제조운영 AI 아키텍처 해부
지식관리, 업무자동화, 의사결정지원 — 따로 보면 다 있던 것들입니다. 제조 AI의 진짜 차이는 이 셋이 순환하면서 '우리 공장만의 지능'을 만든다는 데 있습니다.

그 30분을 18년 동안 매일 반복했습니다 — 품질팀장이 본 AI Agent
18년차 품질팀장이 매일 아침 30분씩 반복하던 데이터 분석을 AI Agent가 3분 만에 해냈습니다. 챗봇과는 완전히 다른 물건 — 직접 시스템에 접근해서 데이터를 꺼내고 분석하는 AI의 현장 도입기.

ERP 20년, 나는 왜 AI를 얹기로 했나
ERP 20년차 제조IT본부장의 고백: 3,200만 행의 데이터가 잠들어 있었다. ERP를 바꾸지 않고 AI를 얹자, 일주일 걸리던 불량 분석이 수 초로 줄었다.
Want to apply this in your factory?
MOAI helps manufacturing companies adopt AI tailored to their operations.
Talk to us →