Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged Master to React-Django Branch and Added Area Text Content #5

Merged
merged 3 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 29 additions & 3 deletions frontend/components/Datasets.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import {
SkeletonCircle,
SkeletonText,
Link,
Image as ChakraImage,
} from "@chakra-ui/react";
import Image from "next/image";
import axios from "axios";
Expand Down Expand Up @@ -127,19 +128,44 @@ export default function Datasets() {
>
<Box
position={"relative"}
height={"300px"}
rounded={"2xl"}
boxShadow={"2xl"}
width={"full"}
overflow={"hidden"}
>
<Image
<ChakraImage
alt={"Hero Image"}
fill
src={`${imagePrefix}/assets/data-collection.png`}
/>
</Box>
</Flex>
<Text>
Early on in our journey, we recognized that advancing Indian
technology necessitates large-scale datasets. Thus, building and
collecting extensive datasets across multiple verticals has become a
critical endeavor at AI4Bharat. Thanks to generous grants from
MeitY, we are spearheading pioneering efforts in data collection as
part of the Data Management Unit of Bhashini. Our nationwide
initiative aims to gather 15,000 hours of transcribed data from over
400 districts, encompassing all 22 scheduled languages of India. In
parallel, our in-house team of over 100 translators is diligently
creating a parallel corpus with 2.2 million translation pairs across
22 languages. To produce studio-quality data for expressive TTS
systems, we have established recording studios in our lab, where
professional voice artists contribute their expertise. Additionally,
our annotators are meticulously labeling pages for Document Layout
Parsing, accommodating the diverse scripts of India. To accelerate
the development of Indic Large Language Models (LLMs), we are
focused on building pipelines for curating and synthetically
generating pre-training data, collecting contextually grounded
prompts, and creating evaluation datasets that reflect India’s rich
linguistic tapestry. Collecting and annotating data at this scale
demands standardization of processes and tools. To meet this
challenge, AI4Bharat has invested in developing various open-source
data collection and annotation tools, aiming to enhance these
efforts not only within India but also in multilingual regions
across the globe.
</Text>
</Stack>
{isLoading ? (
<SimpleGrid columns={{ base: 1, md: 3 }} spacing={10}>
Expand Down
17 changes: 16 additions & 1 deletion frontend/components/Dynamic/Area.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ const areaInfo: { [key: string]: { title: string; description: string } } = {
nmt: {
title: "Machine Translation",
description:
"AI4Bharat is a pioneering initiative focused on building open-source AI solutions that address challenges unique to India. One of their significant contributions is in the field of machine translation, where they aim to bridge the linguistic diversity of the country. AI4Bharat has developed state-of-the-art models that facilitate the translation of text between Indian languages, enabling seamless communication across different linguistic communities. Their work includes creating large-scale datasets, fine-tuning models for regional languages, and ensuring these tools are accessible to developers and researchers. This initiative not only promotes inclusivity but also helps preserve the rich linguistic heritage of India by making digital content available in multiple languages.",
"Our machine translation models, including IndicTransv2, are built on large-scale datasets mined from the web and carefully curated human translations, catering to all 22 Indian languages and competing with commercial models as validated on multiple benchmarks.",
},
llm: {
title: "Large Language Models",
Expand All @@ -40,6 +40,21 @@ const areaInfo: { [key: string]: { title: string; description: string } } = {
models, while ensuring diversity in their generation capabilities, thereby advancing the frontier of
language technology for India’s diverse linguistic landscape.`,
},
asr: {
title: "Automatic Speech Recognition",
description:
"Our ASR models, including IndicWav2Vec and IndicWhisper, are trained on rich datasets like Kathbath, Shrutilipi and IndicVoices, covering multiple Indian languages.",
},
tts: {
title: "Speech Synthesis",
description:
"AI4Bharat’s TTS efforts, exemplified by AI4BTTS, focus on creating natural-sounding synthetic voices for Indian languages using a mix of web-crawled data and carefully curated datasets like Rasa.",
},
xlit: {
title: "Transliteration",
description:
"AI4Bharat’s transliteration models, like IndicXlit, are optimized for converting text between scripts of Indian languages and English, leveraging large scale datasets such as Aksharantar",
},
};

const fetchAreaData = async (slug: string) => {
Expand Down
8 changes: 4 additions & 4 deletions frontend/components/Features.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"Our machine translation models, including IndicTransv2, are built on large-scale datasets mined from the web and carefully curated human translations, catering to all 22 Indian languages and competing with commercial models as validated on multiple benchmarks."
}
href={`${imagePrefix}/areas/nmt`}
/>
Expand All @@ -118,7 +118,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"AI4Bharat’s transliteration models, like IndicXlit, are optimized for converting text between scripts of Indian languages and English, leveraging large scale datasets such as Aksharantar"
}
href={`${imagePrefix}/areas/xlit`}
/>
Expand All @@ -133,7 +133,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"Our ASR models, including IndicWav2Vec and IndicWhisper, are trained on rich datasets like Kathbath, Shrutilipi and IndicVoices, covering multiple Indian languages."
}
href={`${imagePrefix}/areas/asr`}
/>
Expand All @@ -148,7 +148,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"AI4Bharat’s TTS efforts, exemplified by AI4BTTS, focus on creating natural-sounding synthetic voices for Indian languages using a mix of web-crawled data and carefully curated datasets like Rasa."
}
href={`${imagePrefix}/areas/tts`}
/>
Expand Down
101 changes: 101 additions & 0 deletions frontend/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions frontend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"markdown-to-jsx": "^7.5.0",
"next": "14.2.5",
"react": "^18",
"react-audio-voice-recorder": "^2.2.0",
"react-dom": "^18",
"react-icons": "^5.3.0",
"react-markdown": "^9.0.1",
Expand Down
Loading