I want to use Azure OpenAI but being asked to provide an OPENAI_APIKEY?

fcaldas · December 15, 2024, 10:55pm

Description

Hi,

I am trying to use Weaviate with the Azure OpenAI service. I have a gpt-4o model deployed there.

I am connecting to the weaviate docker container like this:

self.client = weaviate.connect_to_custom(
            http_host=self.config.weaviate_host,
            http_port=self.config.weaviate_port,
            http_secure=False,
            grpc_host=self.config.weaviate_host,
            grpc_port=self.config.weaviate_grpc_port,
            grpc_secure=False,
            headers={
                "X-Azure-Api-Key": self.config.azure_openai_key,
                "X-Azure-Client-Value": self.resource_name
            }
        )

My docker compose is as follows:

services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.27.8
    ports:
    - 8080:8080
    - 50051:50051
    volumes:
    - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_API_BASED_MODULES: 'true'
      CLUSTER_HOSTNAME: 'node1'
      ENABLE_MODULES: 'text2vec-azure-openai'
      AZURE_OPENAI_ENDPOINT: 'https://*****instance.openai.azure.com'
      AZURE_OPENAI_API_KEY: '****'
volumes:
  weaviate_data:

I am creating the collection like this:

def _create_collection(self, resource_name: str):
        """Create the Weaviate collection if it doesn't exist."""
        try:
            try:
                collection = self.client.collections.get(self.collection_name)
                logging.info(f"Using existing collection: {self.collection_name}")
            except weaviate.exceptions.WeaviateQueryError:
                # Collection doesn't exist, create it
                collection = self.client.collections.create(
                    name=self.collection_name,
                    vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_azure_openai(
                        vectorizer="text2vec-azure-openai",
                        resource_name=resource_name,
                        deployment_id=self.config.azure_openai_deployment
                    ),
                    properties=[
                        weaviate.classes.properties.Property(
                            name="content",
                            data_type=weaviate.classes.datatypes.DataType.TEXT,
                            description="The chunk content",
                            vectorize=True
                        ),
                        weaviate.classes.properties.Property(
                            name="doc_id",
                            data_type=weaviate.classes.datatypes.DataType.TEXT,
                            description="Document identifier"
                        ),
                        weaviate.classes.properties.Property(
                            name="chunk_id",
                            data_type=weaviate.classes.datatypes.DataType.INT,
                            description="Chunk number within document"
                        ),
                        weaviate.classes.properties.Property(
                            name="source",
                            data_type=weaviate.classes.datatypes.DataType.TEXT,
                            description="Document source"
                        ),
                        weaviate.classes.properties.Property(
                            name="last_updated",
                            data_type=weaviate.classes.datatypes.DataType.DATE,
                            description="Last update timestamp"
                        ),
                        weaviate.classes.properties.Property(
                            name="content_hash",
                            data_type=weaviate.classes.datatypes.DataType.TEXT,
                            description="Hash of document content"
                        ),
                        weaviate.classes.properties.Property(
                            name="file_path",
                            data_type=weaviate.classes.datatypes.DataType.TEXT,
                            description="Original file path"
                        )
                    ]
                )
                logging.info(f"Created new collection: {self.collection_name}")
                
        except Exception as e:
            logging.error(f"Error creating collection: {str(e)}")
            raise

And ingesting documents:

def ingest_document(self, content: str, source: str, file_path: str = None) -> str:
        """Ingest a document into Weaviate."""
        try:
            doc_id = self._generate_doc_id(content, source)
            content_hash = hashlib.md5(content.encode()).hexdigest()
            
            # Get collection
            collection = self.client.collections.get(self.collection_name)
            
            # Delete existing chunks if document exists
            try:
                where_filter = {
                    "path": ["doc_id"],
                    "operator": "Equal",
                    "valueString": doc_id
                }
                collection.data.delete_many(where_filter)
            except Exception as e:
                logging.warning(f"Error deleting existing chunks: {str(e)}")

            # Create new chunks
            chunks = self._chunk_document(content)
            current_time = datetime.now(timezone.utc).isoformat()
            
            # Prepare objects for batch import
            objects = []
            for i, chunk in enumerate(chunks):
                properties = {
                    "content": chunk,
                    "doc_id": doc_id,
                    "chunk_id": i,
                    "source": source,
                    "last_updated": current_time,
                    "content_hash": content_hash
                }
                
                if file_path:
                    properties["file_path"] = file_path
                    
                objects.append(properties)

            # Import all chunks in a single batch
            if objects:
                collection.data.insert_many(objects)

            return doc_id
            
        except Exception as e:
            logging.error(f"Error ingesting document: {str(e)}")
            raise

And, yet, I am getting the following error:

2024-12-16 09:45:34,854 - ERROR - Error processing C:\Projects\Qualification Toolbox\backend\documents\technical qualification-v30-20241202_045512.pdf: Every object failed during insertion. Here is the set of all errors: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY
Processing existing documents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 12.53it/s] 
2024-12-16 09:45:34,872 - ERROR - Error querying similar chunks: Query call with protocol GRPC search failed with message <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "explorer: get class: concurrentTargetVectorSearch): explorer: get class: vectorize search vector: vectorize params: vectorize params: vectorize keywords: remote client vectorize: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-12-15T22:45:34.8592749+00:00", grpc_status:2, grpc_message:"explorer: get class: concurrentTargetVectorSearch): explorer: get class: vectorize search vector: vectorize params: vectorize params: vectorize keywords: remote client vectorize: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY"}"
>.

Why am I being asked to provide a OPENAI_APIKEY ?

Server Setup Information

Weaviate Server Version:
Deployment Method: official docker image
Multi Node? Number of Running Nodes:
Client Language and Version: EN, weaviate-client==4.9.6
Multitenancy?: No

sebawita · December 16, 2024, 10:34am

Hi @fcaldas,

Can you try to provide the base_url when you define the vectorizer:

Like this:

client.collections.create(
    name="Jeopardy",

    vectorizer_config=Configure.Vectorizer.text2vec_azure_openai(
        deployment_id="text-embedding-3-small", # OpenAI model
        resource_name=AZURE_RESOURCE_NAME,
        base_url=AZURE_BASE_URL
    ),
)

remove vectorizer

Btw. There is no vectorizer property in text2vec_azure_openai()

fcaldas:

vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_azure_openai(
      # vectorizer="text2vec-azure-openai", # <== this should be removed
      resource_name=resource_name,
      deployment_id=self.config.azure_openai_deployment
),

Here are available properties:

fcaldas · December 17, 2024, 3:22am

Thank you!

That makes sense. I ended up finding out that I don’t have access to the Azure OpenAI Embedding model in my Azure OpenAI subscription, so I had to change directions and use sentence-transformers/all-MiniLM-L6-v2 instead.

So I am creating the collection like this:

def _create_new_collection(self):
        """Create a new collection with proper configuration."""
        return self.client.collections.create(
            name=self.collection_name,
            vectorizer_config=weaviate.classes.config.Configure.Vectorizer.none(),
            vector_index_config=weaviate.classes.config.Configure.VectorIndex.hnsw(
                distance_metric=weaviate.classes.config.VectorDistances.COSINE,
                vector_cache_max_objects=1000000,
                max_connections=64,
                ef_construction=128,
                ef=100,
                dynamic_ef_min=100,
                dynamic_ef_max=500,
                dynamic_ef_factor=8,
                flat_search_cutoff=40000,
                cleanup_interval_seconds=300
            ),
            properties=[
                weaviate.classes.config.Property(
                    name="content",
                    data_type=weaviate.classes.config.DataType.TEXT,
                    description="The chunk content",
                    vectorize=True
                ),
                weaviate.classes.config.Property(
                    name="documentId",
                    data_type=weaviate.classes.config.DataType.TEXT,
                    description="Document identifier"
                ),
                weaviate.classes.config.Property(
                    name="chunkId",
                    data_type=weaviate.classes.config.DataType.INT,
                    description="Chunk number within document"
                ),
                weaviate.classes.config.Property(
                    name="source",
                    data_type=weaviate.classes.config.DataType.TEXT,
                    description="Document source"
                ),
                weaviate.classes.config.Property(
                    name="lastUpdated",
                    data_type=weaviate.classes.config.DataType.DATE,
                    description="Last update timestamp"
                ),
                weaviate.classes.config.Property(
                    name="contentHash",
                    data_type=weaviate.classes.config.DataType.TEXT,
                    description="Hash of document content"
                ),
                weaviate.classes.config.Property(
                    name="filePath",
                    data_type=weaviate.classes.config.DataType.TEXT,
                    description="Original file path"
                )
            ]
        )

And ingesting the documents like this:

def ingest_document(self, content: str, source: str, file_path: str = None) -> str:
        """Ingest a document into Weaviate."""
        try:
            doc_id = self._generate_doc_id(content, source)
            content_hash = hashlib.md5(content.encode()).hexdigest()
            
            # Get collection
            collection = self.client.collections.get(self.collection_name)
            
            # Delete existing chunks if document exists
            try:
                collection.data.delete_many(
                    where={
                        "path": ["documentId"],
                        "operator": "Equal",
                        "valueString": doc_id
                    }
                )
                logging.info(f"Deleted existing chunks for document {doc_id}")
            except Exception as e:
                if "not found" not in str(e).lower():
                    logging.warning(f"Error deleting existing chunks: {str(e)}")

            # Create new chunks
            chunks = self._chunk_document(content)
            current_time = datetime.now(timezone.utc).isoformat()
            
            # Prepare objects for batch import
            objects_to_create = []
            for i, chunk in enumerate(chunks):
                # Generate vector for the chunk
                vector = self._generate_embedding(chunk)
                
                properties = {
                    "content": chunk,
                    "documentId": doc_id,
                    "chunkId": i,
                    "source": source,
                    "lastUpdated": current_time,
                    "contentHash": content_hash
                }
                
                if file_path:
                    properties["filePath"] = file_path
                    
                # Create object with vector
                objects_to_create.append({
                    "properties": properties,
                    "vector": vector
                })

            # Import objects in batches
            batch_size = 100
            for i in range(0, len(objects_to_create), batch_size):
                batch = objects_to_create[i:i + batch_size]
                try:
                    # Use batch import
                    with collection.batch.dynamic() as batch_writer:
                        for obj in batch:
                            batch_writer.add_object(
                                properties=obj["properties"],
                                vector=obj["vector"]
                            )
                    logging.info(f"Successfully inserted batch of {len(batch)} chunks for document {doc_id}")
                except Exception as e:
                    logging.error(f"Error inserting batch: {str(e)}")
                    raise

            return doc_id

This all works, I am able to connect to weaviate, ingest the documents in batches and trigger my prompt with the relevant chunks to Azure OpenAI.

But, with that said, the answers I am getting from Azure OpenAI are pretty average and it doesn’t look like it’s considering the knowledge from my internal documents that i’ve passed in as relevant chunks, so this is now where I need to figure out what’s going on.

Cheers

sebawita · December 17, 2024, 12:42pm

Can you share an example of how you query your data?
This might help us figure out if you could change something about your queries.

sebawita · December 17, 2024, 1:15pm

By the way, most Wednesdays we run Office Hours, during which you can ask questions and get help from our experts. Which could help you get this solved

You can register here, and we will send you a calendar invite with link to the meeting

Also, we run other workshops and other events, you can learn more here

Topic		Replies	Views
AZURE OPEN AI vectorizer not working any leads will be appreciated Support bug , python , technical	3	249	April 17, 2025
"Incorrect API key provided" Error when working with Azure OpenAI Support	6	2768	April 16, 2024
Appkey Configuration - Azure OpenAI Support integration , python , azure	3	313	August 13, 2024
How to use Python V4 Api with Azure? Support developer-experience , wcs	16	1230	August 20, 2024
Weaviate python client setup with Azure OpenAI keys Support integration	5	1808	May 20, 2024

I want to use Azure OpenAI but being asked to provide an OPENAI_APIKEY?

Description

Server Setup Information

remove vectorizer

Related topics