레이블링된 데이터는 하나 이상의 레이블이 태그된 샘플 그룹이다. 레이블링은 일반적으로 레이블이 없는 데이터 세트를 가져와 각 데이터에 판정이라 불리는 정보성 태그를 추가하는 과정이다. 예를 들어, 데이터 레이블은 사진에 말이 있는지 소가 있는지, 오디오 녹음에서 어떤 단어가 발화되었는지, 비디오에서 어떤 유형의 행동이 수행되고 있는지, 뉴스 기사의 주제가...

Labeled data

{{Short description|Group of samples that have been tagged with one or more labels}} {{Multiple issues|{{more citations needed | date = May 2017}} {{Tone|date=April 2024}}}} {{Machine learning bar}}'''Labeled data''' is a group of [[Sample (statistics)|samples]] that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags called '''judgments'''. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor.

Labels can be obtained by having humans make judgments about a given piece of unlabeled data.{{Cite web |title=What is Data Labeling? - Data Labeling Explained - AWS |url=https://aws.amazon.com/what-is/data-labeling/ |access-date=2024-07-16 |website=Amazon Web Services, Inc. |language=en-US}} Labeled data is significantly more expensive to obtain than the raw unlabeled data.

The quality of labeled data directly influences the performance of [[Supervised learning|supervised machine learning]] models in operation, as these models learn from the provided labels.{{Citation |last1=Fredriksson |first1=Teodor |title=Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies |date=2020 |work=Product-Focused Software Process Improvement |volume=12562 |pages=202–216 |editor-last=Morisio |editor-first=Maurizio |url=https://link.springer.com/10.1007/978-3-030-64148-1_13 |access-date=2024-07-13 |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-64148-1_13 |isbn=978-3-030-64147-4 |last2=Mattos |first2=David Issa |last3=Bosch |first3=Jan |last4=Olsson |first4=Helena Holmström |editor2-last=Torchiano |editor2-first=Marco |editor3-last=Jedlitschka |editor3-first=Andreas|url-access=subscription }}

==Crowdsourced labeled data== In 2006, [[Fei-Fei Li]], the co-director of the [[Stanford University|Stanford]] Human-Centered AI Institute, initiated research to improve the [[artificial intelligence]] models and algorithms for image recognition by significantly enlarging the [[training data]]. The researchers downloaded millions of images from the [[World Wide Web]] and a team of undergraduates started to apply labels for objects to each image. In 2007, Li outsourced the data labeling work on [[Amazon Mechanical Turk]], an [[online marketplace]] for digital [[piece work]]. The 3.2 million images that were labeled by more than 49,000 workers formed the basis for [[ImageNet]], one of the largest hand-labeled database for [[outline of object recognition]].{{Cite book|author1=Mary L. Gray |author2=Siddharth Suri |title=Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass|publisher=Houghton Mifflin Harcourt|year=2019|isbn=978-1-328-56628-7|page=7}}

==Automated data labelling== After obtaining a labeled dataset, [[machine learning]] models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.Johnson, Leif. [https://stackoverflow.com/a/19172720 "What is the difference between labeled and unlabeled data?"], ''[[Stack Overflow]]'', 4 October 2013. Retrieved on 13 May 2017. {{CC-notice|cc=bysa3|url=https://stackoverflow.com/a/19172720|author=[https://stackoverflow.com/users/2014584/lmjohns3 lmjohns3]}}

==Challenges==

=== Data-driven bias === [[Algorithmic decision making|Algorithmic decision-making]] is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a [[predictive model]], despite the machine learning algorithm being legitimate. The labeled data used to train a specific machine learning algorithm needs to be a statistically [[representative sample]] to not bias the results.{{Cite book |author=Xianhong Hu |title=Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective |author2=Bhanu Neupane |author3=Lucia Flores Echaiz |author4=Prateek Sibal |author5=Macarena Rivera Lam |publisher=UNESCO Publishing |year=2019 |isbn=978-92-3-100363-9 |page=64}} For example, in [[facial recognition system]]s underrepresented groups are subsequently often misclassified if the labeled data available to train has not been representative of the population,. In 2018, a study by [[Joy Buolamwini]] and [[Timnit Gebru]] demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.{{Cite book |author=Xianhong Hu |title=Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective |author2=Bhanu Neupane |author3=Lucia Flores Echaiz |author4=Prateek Sibal |author5=Macarena Rivera Lam |publisher=UNESCO Publishing |year=2019 |isbn=978-92-3-100363-9 |page=66}}

=== Human error and inconsistency === Human annotators are prone to errors and biases when labeling data. This can lead to inconsistent labels and affect the quality of the data set. The inconsistency can affect the [[machine learning]] model's ability to generalize well.{{Cite journal |last1=Geiger |first1=R. Stuart |last2=Cope |first2=Dominique |last3=Ip |first3=Jamie |last4=Lotosh |first4=Marsha |last5=Shah |first5=Aayush |last6=Weng |first6=Jenny |last7=Tang |first7=Rebekah |date=2021-11-05 |title="Garbage in, garbage out" revisited: What do machine learning application papers report about human-labeled training data? |url=https://direct.mit.edu/qss/article/2/3/795/102771/Garbage-in-garbage-out-revisited-What-do-machine |journal=Quantitative Science Studies |language=en |volume=2 |issue=3 |pages=795–827 |doi=10.1162/qss_a_00144 |doi-access=free|issn=2641-3337|arxiv=2107.02278 }}

=== Domain expertise === Certain fields, such as legal document analysis or [[medical imaging]], require annotators with specialized domain knowledge. Without the expertise, the annotations or labeled data may be inaccurate, negatively impacting the machine learning model's performance in a real-world scenario.{{Cite journal |last1=Alzubaidi |first1=Laith |last2=Bai |first2=Jinshuai |last3=Al-Sabaawi |first3=Aiman |last4=Santamaría |first4=Jose |last5=Albahri |first5=A. S. |last6=Al-dabbagh |first6=Bashar Sami Nayyef |last7=Fadhel |first7=Mohammed A. |last8=Manoufali |first8=Mohamed |last9=Zhang |first9=Jinglan |last10=Al-Timemy |first10=Ali H. |last11=Duan |first11=Ye |last12=Abdullah |first12=Amjed |last13=Farhan |first13=Laith |last14=Lu |first14=Yi |last15=Gupta |first15=Ashish |date=2023-04-14 |title=A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications |journal=Journal of Big Data |volume=10 |issue=1 |pages=46 |doi=10.1186/s40537-023-00727-2 |doi-access=free |issn=2196-1115}}

==See also==

[[Data annotation]]
[[Humans in the Loop (film)|''Humans in the Loop'' (film)]]

==References== {{Reflist}}

[[Category:Machine learning]]

Labeled data

Related Articles

From MOAI Insights