Identify language | Babel Street

The input data may be in any of 364 language–encoding–script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

For all supported languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less). Returns list of candidate languages in order of descending confidence. The response is an ordered list of identified languages, including language and identification confidence, sorted by descending confidence.

Language Identifier identifies the language or languages of the input text. The endpoint returns a list of languages identified in descending order of confidence, so the first result is the best. Language Identifier can also detect different language regions in a multilingual document. When multilingual is set to true, it returns a list of language regions in addition to the whole-document results. The input data may be in any of 364 language–encoding–script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings. When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles. For all supported languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less). Returns list of candidate languages in order of descending confidence. The response is an ordered list of identified languages, including language and identification confidence, sorted by descending confidence.

Authentication

X-BabelStreetAPI-Keystring

API Key authentication via header

X-RosetteAPI-Keystring

API Key authentication via header

Request

This endpoint expects an object.

contentstringOptional

contentUristringOptional

optionsobjectOptional

Response

languageDetectionslist of objects or null

Three-letter ISO 693-3 code of detected languages.

1	curl -X POST https://analytics.babelstreet.com/rest/v1/language \
2	-H "X-RosetteAPI-Key: <apiKey>" \
3	-H "Content-Type: application/json" \
4	-d '{
5	"body": {
6	"content": "https://en.wikipedia.org/wiki/Abd_Rabbuh_Mansur_Hadi"
7	}
8	}'